Approximating Sequence Similarity Brad Solomon

March 27, 2019 Lecture 16: Scalable Methods for Genomics Approximating Sequence Similarity Brad Solomon With Special Focus on RNA-Seq

March 27, 2019 Lecture 16: Scalable Methods for Genomics A Quick Recap

Overlap between two sequences

overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% RNA-seq Challenges overlap - region of similarity between regions overhang - un-aligned ends of the sequences Challenge 1: Eukaryotic genes are spliced The assembler screens merges based on: Solution: Use a spliced aligner, and assemble isoforms • length of overlap TopHat: discovering spliced junctions with RNA-Seq. • % identity in overlap region Trapnell et al (2009) Bioinformatics. 25:0 1105-1111 • maximum overhang size.

[How do we compute the overlap?] Challenge 2: Read Count != Transcript abundance Solution: Infer underlying abundances (e.g. TPM) [Do we really want to do all-vs-all?] Transcript assembly and quantification by RNA-seq Trapnell et al (2010) Nat. Biotech. 25(5): 511-515 See Lecture 4 Assembly & Whole Genome Alignment Challenge 3: Transcript abundances are stochastic Solution: Replicates, replicates, and more replicates

RNA-seq differential expression studies: more sequence or more replication? Liu et al (2013) Bioinformatics. doi:10.1093/bioinformatics/btt688

!2 See Lecture 10 RNA Sequencing Sequence similarity at scale is a ‘big’ problem Open Access Data (Gigabases)

3 Sequence similarity at scale is a ‘big’ problem

The exponential growth of sequence data is a “Big Data” problem Open Access Data (Gigabases)

3 Biological questions at scale

!4 Biological questions at scale

Experiment Discovery I have sample X, find me related samples

!4 Biological questions at scale

Experiment Discovery I have sample X, find me related samples

Containment Query I am interested in transcript X, find me studies expressing it

!4 Biological questions at scale

Experiment Discovery I have sample X, find me related samples

Containment Query I am interested in transcript X, find me studies expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

!4 Biological questions at scale

Experiment Discovery I have sample X, find me related samples

Containment Query I am interested in transcript X, find me studies expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

!4 Biological questions at scale

Experiment Discovery I have sample X, find me related samples

Containment Query I am interested in transcript X, find me studies expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

How can we solve these problems?

!4 Biological questions at scale

Experiment Discovery I have sample X, find me related samples

Containment Query I am interested in transcript X, find me studies expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

How can we solve these problems?

!5 Biological questions at scale

Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples

Containment Query I am interested in transcript X, find me studies expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

How can we solve these problems?

!5 Biological questions at scale

Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples

Containment Query Align each study to a reference, observe X I am interested in transcript X, find me studies expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

How can we solve these problems?

!5 Biological questions at scale

Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples

Containment Query Align each study to a reference, observe X I am interested in transcript X, find me studies expressing it

Align each study to a reference, observe Cardinality Estimation I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

How can we solve these problems?

!5 Biological questions at scale

Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples

Containment Query Align each study to a reference, observe X I am interested in transcript X, find me studies expressing it

Align each study to a reference, observe Cardinality Estimation I have N studies, how many unique sequences are present?

Align each study to a reference, normalize and observe Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies

How can we solve these problems?

!5 Methods for alignment at scale

ExAC Browser Beta

!6 Methods for alignment at scale

ExAC Browser Beta

!6 Methods for alignment at scale

ExAC Browser Beta

!6 Rail-RNA: Bulk Alignment

Given a of N sequencing studies,

1) Aggregate the reads into sets of overlapping reads and readlets (a subsequence of a set of reads with partial overlap)

2) Perform parallel alignment where each thread is given distinct reads based on nucleotide similarity

3) Second pass alignment on all non-perfect or tied score alignments adds quality scores to break ties. Additional readlets are generated for poor alignments

4) Readlets are parallel aligned based on nucleotide similarity

5) Further steps are taken to identify exon-exon junctions and one final alignment is performed using the bulk exon data identified from the studies

6) All individual alignments are output, with a single primary alignment compiled for each individual study

Rail-RNA: scalable analysis of RNA-seq splicing and coverage Nellore et al (2017) Bioinformatics Rail-RNA: Bulk Alignment

Given a set of N sequencing studies,

1) Aggregate the reads into sets of overlapping reads and readlets (a subsequence of a set of reads with partial overlap)

2) Perform parallel alignment where each thread is given distinct reads based on nucleotide similarity

3) Second pass alignment on all non-perfect or tied score alignments adds quality scores to break ties. Additional readlets are generated for poor alignments

4) Readlets are parallel aligned based on nucleotide similarity

5) Further steps are taken to identify exon-exon junctions and one final alignment is performed using the bulk exon data identified from the studies

6) All individual alignments are output, with a single primary alignment compiled for each individual study

Rail-RNA: scalable analysis of RNA-seq splicing and coverage Nellore et al (2017) Bioinformatics Rail-RNA: Bulk Alignment

Given a set of N sequencing studies,

1) Aggregate the reads into sets of overlapping reads and readlets (a subsequence of a set of reads with partial overlap)

2) Perform parallel alignment where each thread is given distinct reads based on nucleotide similarity

3) Second pass alignment on all non-perfect or tied score alignments adds quality scores to break ties. Additional readlets are generated for poor alignments

4) Readlets are parallel aligned based on nucleotide similarity

5) Further steps are taken to identify exon-exon junctions and one final alignment is performed using the bulk exon data identified from the studies

6) All individual alignments are output, with a single primary alignment compiled for each individual study

Rail-RNA: scalable analysis of RNA-seq splicing and coverage Nellore et al (2017) Bioinformatics Recount2: Combining methods

1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches

48,558 samples from SRA 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available BigWig

3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Recount2: Combining methods

1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches

48,558 samples from SRA 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available How were these selected? BigWig

3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Recount2: Combining methods

How were the batches selected? 1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches

48,558 samples from SRA 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available How were these selected? BigWig

3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Recount2: Combining methods

How were the batches selected? 1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches

48,558 samples from SRA How is this designed? 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available How were these selected? BigWig

3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Alignment can’t scale to Big Data

Recount2 (2017) Open Access Data (Gigabases)

9 Alignment can’t scale to Big Data

Recount2 (2017) Recount2 processed 70,603 samples Open Access Data (Gigabases)

9 Alignment can’t scale to Big Data

The SRA has 4,782,578 samples

Recount2 (2017) Recount2 processed 70,603 samples Open Access Data (Gigabases)

9 Recap: Sailfish BRIEF COMMUNICATIONS 1) Indexing Parse reference transcriptome into kmers Figure •1 Overview of the Sailfish pipeline. (a,b) Sailfish consists of an Reference transcripts indexing phase (a) that is invoked by the command ‘sailfish index’ and a quantification phase (b) invoked by the command ‘sailfish quant’. All unique kmers are hashed and counted Read data The Sailfish• index has four components: (1) a perfect mapping a Index each k-mer in the transcript set to a unique integer between 0 and N−1, (per transcriptome & (1) where N is the number of unique k-mers in the set of transcripts; (2) an array choice of k) Array of k-mer counts (2) recording• theTwo number indices of times each for k-mer bi-directional occurs in the reference mapping set; (3) an of index mapping each transcript to the multiset of k-mers that it contains; Transcript k-mer transcripts and kmers to contained to containing (4) an index mapping each k-mer to the set of transcripts in which it k-mers transcripts appears. The quantification phase consists of counting the indexed k-mers (3) (4) in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count b Quant assignments are illustrated by vertical gray bars on lines representing known Intersect reads with Unhashable (per set of reads) transcripts; the horizontal lines intersecting the gray bars represent the indexed k-mers k-mers average of the current k-mer count assignments for each transcript. Hashable k-mers may be more affected by errors in the reads (Supplementary Fig. 1). Aggregate counts Further, many of our data structures can be represented as arrays of atomic integers (Online Methods). This allows our software to Estimate transcript be concurrent and lock-free where possible, leading to an approach abundances from allocated k-mers that scales well with the number of available CPUs (Supplementary Fig. 2). Additional benefits of the Sailfish approach are discussed in

Supplementary Note 1. … … Sailfish works in two phases: indexing and quantification (Fig. 1). A Sailfish index is built from a particular set of reference transcripts Transcripts with Reallocate Average k-mer current k-mer k-mers based coverage (a FASTA sequence file) and a specific choice of k-mer length, k. allocations on abundance The index consists of data structures that make counting k-mers in estimates a set!10 of reads and resolving their potential origin in the set of tran- scripts efficient (Online Methods). The most important data struc- needs to be rebuilt only when the set of reference transcripts or the ture in the index is the minimal perfect hash function9 that maps choice of k changes. each k-mer in the reference transcripts to an integer identifier such The quantification phase of Sailfish takes as input the above that no two k-mers share an identifier. Pairing the minimum perfect index and a set of RNA-seq reads and produces an estimate of the hash function with an atomically updateable array of k-mer counts relative abundance of each transcript in the reference, measured allows k-mers to be counted even faster than with existing advanced in reads per kilobase per million mapped reads (RPKM), k-mers lock-free hashes such as that used in Jellyfish10. The index also con- per kilobase per million mapped k-mers (KPKM) and transcripts tains a pair of look-up tables that allow fast access to the k-mers per million (TPM) (see Online Methods for the definitions of these appearing in a specific transcript as well as the transcripts in which a measures). First, Sailfish counts the number of times each indexed particular k-mer appears, both in amortized constant time. The index k-mer occurs in the set of reads. Owing to the index, this process is efficient and scalable (Supplementary Fig. 2). Sailfish then applies an EM procedure to determine maximum likelihood estimates a c 4

) 8 10 Alignment 2 Figure 2 Speed and accuracy of Sailfish. (a) The correlation between qPCR Quantification 6 estimates of gene abundance (x axis) and the estimates of Sailfish. The 4 qPCR results are taken from the microarray quality control study (MAQC)15. 2 18.55 h 0 103 The results shown here are for the human brain tissue, and the RNA-seq– 10.19 h 7.70 h –2 6.03 h 6.96 h based estimates were computed using the reads from SRA accession –4

Sailfish KPKM(log SRX016366 (81,250,481 35bp single-end reads). The set of transcripts –6 2.27 h used in this experiment were the curated RefSeq20 transcripts (accession 0 2 4 2 –8 –6 –4 –2 10 –10 prefix NM) from hg18 (31,148 transcripts). (b) The correlation between qPCR(log2) Time (min) the ground truth FPKM in a simulated data set (x axis) and the abundance b 0.26 h estimates of Sailfish. The quantification in this experiment was performed ) 2 10 1 on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. 10 0.09 h (c) The total time taken by each method, Sailfish, RSEM, eXpress and 5 Cufflinks, to estimate isoform abundance on each data set. The total time 0 taken by a method is the height of the corresponding bar, and the total is 0 10 further broken down into the time taken to perform read-alignment (for

Sailfish KPKM(log –5 press Sailfish, we instead measured the time taken to count the k-mers in the Sailfish RSEM Sailfish RSEM eX Cufflinks eXpressCufflinks –5 0 5 10 read set) and the time taken to quantify abundance given the aligned SRX016366 Synthetic reads (or k-mer counts). All tools were run in multithreaded mode (where True FPKM(log2) applicable) and were allowed to use up to 16 threads. (d) Accuracy of each d of the methods on human brain tissue and a synthetic data set. Accuracy Human brain tissue Synthetic is measured by the Pearson (log-transformed) and Spearman correlation Sailfish RSEM eXpress CufflinksSailfish RSEM eXpress Cufflinks coefficients between estimated abundance values and MAQC qPCR data (for Pearson 0.85 0.82 0.85 0.85 0.96 0.96 0.95 0.94 human brain tissue) or ground truth (for simulated data). Root-mean-square Spearman 0.84 0.80 0.85 0.85 0.76 0.77 0.77 0.75 RMSE – – – – 6.06 8.90 8.83 10.05 error (RMSE) and median percentage error (medPE) are calculated medPE – – – – 6.50 12.48 14.06 12.42 as described in the Online Methods.

NATURE BIOTECHNOLOGY VOLUME 32 NUMBER 5 MAY 2014 463 Recap: Sailfish BRIEF COMMUNICATIONS 1) Indexing Parse reference transcriptome into kmers Figure •1 Overview of the Sailfish pipeline. (a,b) Sailfish consists of an Reference transcripts indexing phase (a) that is invoked by the command ‘sailfish index’ and a quantification phase (b) invoked by the command ‘sailfish quant’. All unique kmers are hashed and counted Read data The Sailfish• index has four components: (1) a perfect hash function mapping a Index each k-mer in the transcript set to a unique integer between 0 and N−1, (per transcriptome & Perfect hash function (1) where N is the number of unique k-mers in the set of transcripts; (2) an array choice of k) Array of k-mer counts (2) recording• theTwo number indices of times each for k-mer bi-directional occurs in the reference mapping set; (3) an of index mapping each transcript to the multiset of k-mers that it contains; Transcript k-mer transcripts and kmers to contained to containing (4) an index mapping each k-mer to the set of transcripts in which it k-mers transcripts appears. The quantification phase consists of counting the indexed k-mers (3) (4) in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count b Quant assignments are illustrated by vertical gray bars on lines representing known Intersect reads with Unhashable (per set of reads) transcripts; the horizontal lines intersecting the gray bars represent the indexed k-mers k-mers average of the current k-mer count assignments for each transcript. 2) Quantification Hashable k-mers • Count kmers in read set may be more affected by errors in the reads (Supplementary Fig. 1). Aggregate counts Further, many of our data structures can be represented as arrays of atomic• Use integers EM (Online procedure Methods). Thisto estimate allows our software transcript to Estimate transcript be concurrent and lock-free where possible, leading to an approach abundances from abundances, repeating as necessary allocated k-mers that scales well with the number of available CPUs (Supplementary Fig. 2). Additional benefits of the Sailfish approach are discussed in

Supplementary Note 1. … … Sailfish works in two phases: indexing and quantification (Fig. 1). A Sailfish index is built from a particular set of reference transcripts Transcripts with Reallocate Average k-mer current k-mer k-mers based coverage (a FASTA sequence file) and a specific choice of k-mer length, k. allocations on abundance The index consists of data structures that make counting k-mers in estimates a set!10 of reads and resolving their potential origin in the set of tran- scripts efficient (Online Methods). The most important data struc- needs to be rebuilt only when the set of reference transcripts or the ture in the index is the minimal perfect hash function9 that maps choice of k changes. each k-mer in the reference transcripts to an integer identifier such The quantification phase of Sailfish takes as input the above that no two k-mers share an identifier. Pairing the minimum perfect index and a set of RNA-seq reads and produces an estimate of the hash function with an atomically updateable array of k-mer counts relative abundance of each transcript in the reference, measured allows k-mers to be counted even faster than with existing advanced in reads per kilobase per million mapped reads (RPKM), k-mers lock-free hashes such as that used in Jellyfish10. The index also con- per kilobase per million mapped k-mers (KPKM) and transcripts tains a pair of look-up tables that allow fast access to the k-mers per million (TPM) (see Online Methods for the definitions of these appearing in a specific transcript as well as the transcripts in which a measures). First, Sailfish counts the number of times each indexed particular k-mer appears, both in amortized constant time. The index k-mer occurs in the set of reads. Owing to the index, this process is efficient and scalable (Supplementary Fig. 2). Sailfish then applies an EM procedure to determine maximum likelihood estimates a c 4

) 8 10 Alignment 2 Figure 2 Speed and accuracy of Sailfish. (a) The correlation between qPCR Quantification 6 estimates of gene abundance (x axis) and the estimates of Sailfish. The 4 qPCR results are taken from the microarray quality control study (MAQC)15. 2 18.55 h 0 103 The results shown here are for the human brain tissue, and the RNA-seq– 10.19 h 7.70 h –2 6.03 h 6.96 h based estimates were computed using the reads from SRA accession –4

Sailfish KPKM(log SRX016366 (81,250,481 35bp single-end reads). The set of transcripts –6 2.27 h used in this experiment were the curated RefSeq20 transcripts (accession 0 2 4 2 –8 –6 –4 –2 10 –10 prefix NM) from hg18 (31,148 transcripts). (b) The correlation between qPCR(log2) Time (min) the ground truth FPKM in a simulated data set (x axis) and the abundance b 0.26 h estimates of Sailfish. The quantification in this experiment was performed ) 2 10 1 on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. 10 0.09 h (c) The total time taken by each method, Sailfish, RSEM, eXpress and 5 Cufflinks, to estimate isoform abundance on each data set. The total time 0 taken by a method is the height of the corresponding bar, and the total is 0 10 further broken down into the time taken to perform read-alignment (for

Sailfish KPKM(log –5 press Sailfish, we instead measured the time taken to count the k-mers in the Sailfish RSEM Sailfish RSEM eX Cufflinks eXpressCufflinks –5 0 5 10 read set) and the time taken to quantify abundance given the aligned SRX016366 Synthetic reads (or k-mer counts). All tools were run in multithreaded mode (where True FPKM(log2) applicable) and were allowed to use up to 16 threads. (d) Accuracy of each d of the methods on human brain tissue and a synthetic data set. Accuracy Human brain tissue Synthetic is measured by the Pearson (log-transformed) and Spearman correlation Sailfish RSEM eXpress CufflinksSailfish RSEM eXpress Cufflinks coefficients between estimated abundance values and MAQC qPCR data (for Pearson 0.85 0.82 0.85 0.85 0.96 0.96 0.95 0.94 human brain tissue) or ground truth (for simulated data). Root-mean-square Spearman 0.84 0.80 0.85 0.85 0.76 0.77 0.77 0.75 RMSE – – – – 6.06 8.90 8.83 10.05 error (RMSE) and median percentage error (medPE) are calculated medPE – – – – 6.50 12.48 14.06 12.42 as described in the Online Methods.

NATURE BIOTECHNOLOGY VOLUME 32 NUMBER 5 MAY 2014 463 Recap: Sailfish BRIEF COMMUNICATIONS 1) Indexing Parse reference transcriptome into kmers Figure •1 Overview of the Sailfish pipeline. (a,b) Sailfish consists of an Reference transcripts indexing phase (a) that is invoked by the command ‘sailfish index’ and a quantification phase (b) invoked by the command ‘sailfish quant’. All unique kmers are hashed and counted Read data The Sailfish• index has four components: (1) a perfect hash function mapping a Index each k-mer in the transcript set to a unique integer between 0 and N−1, (per transcriptome & Perfect hash function (1) where N is the number of unique k-mers in the set of transcripts; (2) an array choice of k) Array of k-mer counts (2) recording• theTwo number indices of times each for k-mer bi-directional occurs in the reference mapping set; (3) an of index mapping each transcript to the multiset of k-mers that it contains; Transcript k-mer transcripts and kmers to contained to containing (4) an index mapping each k-mer to the set of transcripts in which it k-mers transcripts appears. The quantification phase consists of counting the indexed k-mers (3) (4) in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count b Quant assignments are illustrated by vertical gray bars on lines representing known Intersect reads with Unhashable (per set of reads) transcripts; the horizontal lines intersecting the gray bars represent the indexed k-mers k-mers average of the current k-mer count assignments for each transcript. 2) Quantification Hashable k-mers • Count kmers in read set may be more affected by errors in the reads (Supplementary Fig. 1). Aggregate counts Further, many of our data structures can be represented as arrays of atomic• Use integers EM (Online procedure Methods). Thisto estimate allows our software transcript to Estimate transcript be concurrent and lock-free where possible, leading to an approach abundances from abundances, repeating as necessary allocated k-mers that scales well with the number of available CPUs (Supplementary Fig. 2). Additional benefits of the Sailfish approach are discussed in

Supplementary Note 1. … … Sailfish works in two phases: indexing and quantification (Fig. 1). A Sailfish index is built from a particular set of reference transcripts Transcripts with Reallocate Average k-mer current k-mer k-mers based coverage (a FASTA sequence file) and a specific choice of k-mer length, k. allocations on abundance The index consists of data structures that make counting k-mers in estimates a set!10 of reads and resolvingSailfish, their potential Salmon, origin in the and set ofKallisto tran- all operate on a per-sample basis scripts efficient (Online Methods). The most important data struc- needs to be rebuilt only when the set of reference transcripts or the ture in the index is the minimal perfect hash function9 that maps choice of k changes. each k-mer in the reference transcripts to an integer identifier such The quantification phase of Sailfish takes as input the above that no two k-mers share an identifier. Pairing the minimum perfect index and a set of RNA-seq reads and produces an estimate of the hash function with an atomically updateable array of k-mer counts relative abundance of each transcript in the reference, measured allows k-mers to be counted even faster than with existing advanced in reads per kilobase per million mapped reads (RPKM), k-mers lock-free hashes such as that used in Jellyfish10. The index also con- per kilobase per million mapped k-mers (KPKM) and transcripts tains a pair of look-up tables that allow fast access to the k-mers per million (TPM) (see Online Methods for the definitions of these appearing in a specific transcript as well as the transcripts in which a measures). First, Sailfish counts the number of times each indexed particular k-mer appears, both in amortized constant time. The index k-mer occurs in the set of reads. Owing to the index, this process is efficient and scalable (Supplementary Fig. 2). Sailfish then applies an EM procedure to determine maximum likelihood estimates a c 4

) 8 10 Alignment 2 Figure 2 Speed and accuracy of Sailfish. (a) The correlation between qPCR Quantification 6 estimates of gene abundance (x axis) and the estimates of Sailfish. The 4 qPCR results are taken from the microarray quality control study (MAQC)15. 2 18.55 h 0 103 The results shown here are for the human brain tissue, and the RNA-seq– 10.19 h 7.70 h –2 6.03 h 6.96 h based estimates were computed using the reads from SRA accession –4

Sailfish KPKM(log SRX016366 (81,250,481 35bp single-end reads). The set of transcripts –6 2.27 h used in this experiment were the curated RefSeq20 transcripts (accession 0 2 4 2 –8 –6 –4 –2 10 –10 prefix NM) from hg18 (31,148 transcripts). (b) The correlation between qPCR(log2) Time (min) the ground truth FPKM in a simulated data set (x axis) and the abundance b 0.26 h estimates of Sailfish. The quantification in this experiment was performed ) 2 10 1 on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. 10 0.09 h (c) The total time taken by each method, Sailfish, RSEM, eXpress and 5 Cufflinks, to estimate isoform abundance on each data set. The total time 0 taken by a method is the height of the corresponding bar, and the total is 0 10 further broken down into the time taken to perform read-alignment (for

Sailfish KPKM(log –5 press Sailfish, we instead measured the time taken to count the k-mers in the Sailfish RSEM Sailfish RSEM eX Cufflinks eXpressCufflinks –5 0 5 10 read set) and the time taken to quantify abundance given the aligned SRX016366 Synthetic reads (or k-mer counts). All tools were run in multithreaded mode (where True FPKM(log2) applicable) and were allowed to use up to 16 threads. (d) Accuracy of each d of the methods on human brain tissue and a synthetic data set. Accuracy Human brain tissue Synthetic is measured by the Pearson (log-transformed) and Spearman correlation Sailfish RSEM eXpress CufflinksSailfish RSEM eXpress Cufflinks coefficients between estimated abundance values and MAQC qPCR data (for Pearson 0.85 0.82 0.85 0.85 0.96 0.96 0.95 0.94 human brain tissue) or ground truth (for simulated data). Root-mean-square Spearman 0.84 0.80 0.85 0.85 0.76 0.77 0.77 0.75 RMSE – – – – 6.06 8.90 8.83 10.05 error (RMSE) and median percentage error (medPE) are calculated medPE – – – – 6.50 12.48 14.06 12.42 as described in the Online Methods.

NATURE BIOTECHNOLOGY VOLUME 32 NUMBER 5 MAY 2014 463 The Big Data Problem

• There’s an abundance of underutilized sequence data • Alignment can solve a number of important biological questions but doesn’t scale

• Alignment-free methods are more efficient but are not several orders of magnitude more efficient The Big Data Problem

• There’s an abundance of underutilized sequence data • Alignment can solve a number of important biological questions but doesn’t scale

• Alignment-free methods are more efficient but are not several orders of magnitude more efficient

How can we address big data? The Big Data Problem

• There’s an abundance of underutilized sequence data • Alignment can solve a number of important biological questions but doesn’t scale

• Alignment-free methods are more efficient but are not several orders of magnitude more efficient

How can we address big data?

Sketching! Sketching trade accuracy for speed

Given a box, it’s easy to discover what’s inside

!12 Sketching algorithms trade accuracy for speed

Given a box, it’s easy to discover what’s inside But what is there’s too many boxes?

!12 Sketching algorithms trade accuracy for speed

Given a box, it’s easy to discover what’s inside But what is there’s too many boxes?

A sketch solution would organize boxes based on labels, simpler observations such as size or weight, or sub-sample boxes to learn about its neighbors !12 TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another

MIKESCHAT-Z ||X||X|||X| MICES-HATZZ 3

See Lecture 10: RNA Sequencing TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| Much faster to calculate MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another

MIKESCHAT-Z ||X||X|||X| MICES-HATZZ 3

See Lecture 10: RNA Sequencing TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| Much faster to calculate MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another

MIKESCHAT-Z ||X||X|||X| More biologically meaningful MICES-HATZZ 3

See Lecture 10: RNA Sequencing TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| Much faster to calculate MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another

MIKESCHAT-Z ||X||X|||X| More biologically meaningful MICES-HATZZ 3 Can you think of other tradeoffs in See Lecture 10: RNA Sequencing alignment methods? Solving Big Data through Sketching

Experiment Discovery I have sample X, find me related samples

Containment Query I am interested in transcript X, find me experiments expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies

!14 Solving Big Data through Sketching

Experiment Discovery Mash [Minhash] I have sample X, find me related samples

Containment Query I am interested in transcript X, find me experiments expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies

!14 Solving Big Data through Sketching

Experiment Discovery Mash [Minhash] I have sample X, find me related samples

Containment Query Sequence Bloom Trees [] I am interested in transcript X, find me experiments expressing it

Cardinality Estimation

I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies

!14 Solving Big Data through Sketching

Experiment Discovery Mash [Minhash] I have sample X, find me related samples

Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it

Dashing [HyperLogLog] Cardinality Estimation I have N studies, how many unique sequences are present?

Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies

!14 Solving Big Data through Sketching

Experiment Discovery Mash [Minhash] I have sample X, find me related samples

Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it

Dashing [HyperLogLog] Cardinality Estimation I have N studies, how many unique sequences are present?

Salmon / Kallisto Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies

!14 Solving Big Data through Sketching

Experiment Discovery Mash [Minhash] I have sample X, find me related samples

Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it

Dashing [HyperLogLog] Cardinality Estimation I have N studies, how many unique sequences are present?

Salmon / Kallisto Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies

Mantis [Counting ]

!14 © 2015 Nature America, Inc. All rights reserved. computed to determine the overlap offset (0) for is positions their in difference median the and sequences original the in located are case) this in CCG and (ACC min-mers shared the sketches, accurate estimates. ( the error bound controlled by S ARTICLES 624 fraction of entries shared between the sketches of two sequences share the same minimum fingerprints (underlined) for than the set of all of the ordered set of its the min-mer for that hash. ( The generated using an XORShift pseudo-random number generator ( ( H of hash functions determines the resulting sketch size number The functions. hash multiple by fingerprints integer to converted resulting in 12 the sequence into its constituent ( Figure 1 a For sketches. using without reads two between similarity Jaccard the measures exactly that approach a direct with MHAP overlapping 2 Figure sensitivity. 80% over achieve to required is 16-mers ~150 of only a sketch rate, error a 15% with genome human the to reads kbp 10 mapping For overlapping. than easier is genome the in reference a to additive reads high-error mapping roughly reads, two is the of rate error alignment an of rate error the because ( estimate Jaccard the of error expected the reduces which size, sketch the increasing by improved 1 Notes rate of 30%, so MHAP uses 10 kbp reads simulated from the human genome with an overlap error of value largest the use to preferable is it so found, matches false of number ( input overlap maintaining accuracy reads, acceptable while detection the than smaller magnitude of order an sketches use to possible Itis filter. the of sensitivity small. the reduces sketches min-mers keep fewer to using However, preferable is it so size, sketch the to tional storeindex,requiredcomparehash,andtimeto The filtering. alignment efficient for sketches MinHash uses MHAP MinHash alignment filtering RESULTS similarity. estimating for technique efficient putationally 1 Fig. shared of number the with correlated strongly is estimate resulting The sketches. their to be by estimated simply computing the Hamming between distance two of similarity Jaccard the allows hashing ity-sensitive a sequence makes the sketch ( for min-mers of collection The retained. is min-mer, or fingerprint, valued minimum the only function, hash each For functions. hash glesor for a DNA one must sequence, convert all of minimizers as a generalization viewed larity similarity document to applied a Fig. Fig. 2a 2 = 4, four independent hash sets are generated for each sequence sequence each for generated are sets hash independent four = 4, ) To create a MinHash sketch of a DNA sequence 1… (0.5) serves as an estimate of their true Jaccard similarity (0.22), with with (0.22), similarity Jaccard true their of estimate an as serves (0.5) The efficiency of MHAP improves with increased read length. length. read increased with improves MHAP of efficiency The from overlaps kbp 2 detect effectively can 16-mers Specifically,

k H -mer generating the minimum value for each hash is referred to as to referred is hash each for value minimum the generating -mer 23– ). In MHAP, after the initial hash ( ). Because the sketches are comparatively small, this is a com a is this small, comparatively are sketches the Because ). q

, 2 b k -grams) to integer fingerprints using multiple, randomized randomized multiple, using fingerprints integer to -grams) Rapid overlapping of noisy reads using MinHash sketches. sketches. MinHash using reads noisy of overlapping Rapid 5 c and ). For human, using a small value of that maintains sensitivity. maintains that and metagenomic clustering compares the total number of 2 k -mers each for and Online Methods). Sensitivity can be further further be can Sensitivity Methods). Online and k -mers. In this example, the sketches of k- e ) If sufficient similarity is detected between two two between detected is similarity sufficient ) If mers between two sequences ( sequences two between mers H min-mer fingerprints, which is much smaller smaller much is which fingerprints, min-mer c ) The sketch of a sequence is composed composed is a sequence of sketch ) The k H = 16 by default ( default by 16 = . In practice, S Fig. Fig. 1 Supplementary Fig. 1 Fig. Supplementary k 2 and and -mers. In the example shown, 0 , image similarity , image 1 Berlin et al (2015) Assembling large genomes with single-molecule sequencing andlocality-sensitive hashing and Online Methods). This local This Methods). Online and S 2 . ( 1 ), subsequent fingerprints are are fingerprints subsequent ), 2 b 6 k 2 H ) All . The approach can also be be also can approach The . 7 -mers (also known as, shin . Briefly, to create . a to Briefly, create sketch >> 4 is required to obtain >> obtain to required 4 is k Fig. Fig. 2 S k -mers counted during during counted -mers k 1 (e.g., 10) increases the -mers are then then are -mers S and and , we first decomposed decomposed first , we ( in the be estimatedbytheoverlap 4) TheJaccard similaritycan each hashfunctionis chosen 3) Thesmallestvalues for Nature Biotechnology ( 2) Multiplehashfunctions into 1) Sequencedecomposed Minhash) b H k 2 Γ 1 2 , -mers is propor is -mers . Here, where where . Here, S Supplementary Supplementary , sequence simi and and )mapkmerstovalues. Supplementary Supplementary 2 ). Additionally, Additionally, ). . S kmers 1 Min and and k 2 -mer sets sets -mer . ( k S = 3, 2… S d imum The MinhashSketch 1 ) The ) The 2 and and

H

). ).

- - - - -

Hash trols how many alignments are reported for each read. The HGAP sensitivityis primarily affected by the balanced that ( accuracy with chosen speed were sensitive) and (fast settings parameter ( sweep parameter the on Based threshold. similarity Jaccard the and Methods). Online and magnitude faster than BLASR at all levels of sensitivity sensitivity of levels all at BLASR than ( faster magnitude of order an and it tested genomes all alignments; across pairs accurate possible consistently was all all considers overlapping MHAP for contrast, suited In reads. ideally mapping of not for is and designed reference a originally to was BLASR BWA-MEM, Like assembler sets sets assembler ( 2 Tables chemistries Supplementary sequencing and settings parameter multiple using evaluated were tools the and genomes, reference to reads ping comparing by map from inferred were overlaps,which true overlapsto evaluated detected was DALIGNER and BLASR MHAP, of overlaps noisy all between pairs of reads ( detect reliably not did algorithms these of versions BWA-MEM BLASR reads, SMRT for designed tools other overlap of MHAP two per.sensitive versus and Wespecificity sensitivity the evaluated highly a also is MHAP fast, being to addition In MHAP overlapping performance technologies. sequencing long-read future of accuracy and length read increasing the with improve to expected is 1 Note for read length; of increasing decreases reads (which plexity is governed only by the sketch size (a constant) and the number com the because length, read increasing with rapidly MHAP decays by performed comparisons min-mer of number relative the length, overlap 20% a and minimum sequenced, bases of total number fixed long-read overlaps. Although developed for the Dazzler assembler, assembler, Dazzler the for developed Although overlaps. long-read ( coverage replicon uneven and complexity sequence by affected and BLASR genome-dependent genomes. highly was sensitivity and repetitive runtime for overlaps missed in result can this Supplementary Table 2 Supplementary Supplementary Figs. 2 Figs. Supplementary MHAP sensitivity is tunable based on the size of Like MHAP, DALIGNER utilizes efficient VOLUME 33 VOLUME es and and e d c b a Supplementary Table 1 Supplementary 3 0 , SNAP 18 20 33 24 22 33 40 58 14 19 [

54 5 S bestn 1 5 1 , NUMBER 6 Sketch : 13 28 48 28 23 37 57 14

1 84 36 75 1 2 ,

equal to the depth of sequencing coverage, but but coverage, sequencing of depth the to equal 3

1 2 Table Table 56 11 60 47 11 16 36 57 ( and RazerS and , S 24 72 04 2 3 1 )

and ) and empirical assembly tests, two MHAP two tests, assembly empirical and ) 15 Supplementary Note 3 Supplementary S S and 39 54 43 26 54 15 19 36 61 ] 2 1

1 6 5 4 J : : 1 3 ( JUNE 2015 S and 3 ). 1 , min-mers and S 2 ) 3 Supplementary Note 2 Note Supplementary ≈ 2 ). Thus, the efficiency of MHAP of efficiency the Thus, ).

were also evaluated, but current were evaluated, also 2/4 =0.5 Supplementary Figs. 2 2 Figs. Supplementary bestn

NATURE NATURE BIOTECHNOLOGY 27 54 13 35 24 22 49 44 11 18 36 k 5 [ 1 parameter, which con 5 -mer matching to detect , Sketch 2 33 56 30 48 44 27 54 13 19 8

64 75 1 1 2 and DALIGNER

, ). ). The performance

6 28 39 60 47 27 33 56 14 ( k S 3 , 94 04 95 6 Supplementary Supplementary , the sketch size size sketch , the 2 : )

S 6 2 11 18 43 26 49 28 39 57 ] 4 4 2 5 6 ). BLASR BLASR ). Table Table Table Table and 1 2 3 1 1 9 ). ). 5 - - - - , .

© 2015 Nature America, Inc. All rights reserved. computed to determine the overlap offset (0) for is positions their in difference median the and sequences original the in located are case) this in CCG and (ACC min-mers shared the sketches, accurate estimates. ( the error bound controlled by S ARTICLES 624 fraction of entries shared between the sketches of two sequences share the same minimum fingerprints (underlined) for than the set of all of the ordered set of its the min-mer for that hash. ( The generated using an XORShift pseudo-random number generator ( ( H of hash functions determines the resulting sketch size number The functions. hash multiple by fingerprints integer to converted resulting in 12 the sequence into its constituent ( Figure 1 a For sketches. using without reads two between similarity Jaccard the measures exactly that approach a direct with MHAP overlapping 2 Figure sensitivity. 80% over achieve to required is 16-mers ~150 of only a sketch rate, error a 15% with genome human the to reads kbp 10 mapping For overlapping. than easier is genome the in reference a to additive reads high-error mapping roughly reads, two is the of rate error alignment an of rate error the because ( estimate Jaccard the of error expected the reduces which size, sketch the increasing by improved 1 Notes rate of 30%, so MHAP uses 10 kbp reads simulated from the human genome with an overlap error of value largest the use to preferable is it so found, matches false of number ( input overlap maintaining accuracy reads, acceptable while detection the than smaller magnitude of order an sketches use to possible Itis filter. the of sensitivity small. the reduces sketches min-mers keep fewer to using However, preferable is it so size, sketch the to tional storeindex,requiredcomparehash,andtimeto The filtering. alignment efficient for sketches MinHash uses MHAP MinHash alignment filtering RESULTS similarity. estimating for technique efficient putationally 1 Fig. shared of number the with correlated strongly is estimate resulting The sketches. their to be by estimated simply computing the Hamming between distance two of similarity Jaccard the allows hashing ity-sensitive a sequence makes the sketch ( for min-mers of collection The retained. is min-mer, or fingerprint, valued minimum the only function, hash each For functions. hash glesor for a DNA one must sequence, convert all of minimizers as a generalization viewed larity similarity document to applied a Fig. Fig. 2a 2 = 4, four independent hash sets are generated for each sequence sequence each for generated are sets hash independent four = 4, ) To create a MinHash sketch of a DNA sequence 1… (0.5) serves as an estimate of their true Jaccard similarity (0.22), with with (0.22), similarity Jaccard true their of estimate an as serves (0.5) The efficiency of MHAP improves with increased read length. length. read increased with improves MHAP of efficiency The from overlaps kbp 2 detect effectively can 16-mers Specifically,

k H -mer generating the minimum value for each hash is referred to as to referred is hash each for value minimum the generating -mer 23– ). In MHAP, after the initial hash ( ). Because the sketches are comparatively small, this is a com a is this small, comparatively are sketches the Because ). q

, 2 b k -grams) to integer fingerprints using multiple, randomized randomized multiple, using fingerprints integer to -grams) Rapid overlapping of noisy reads using MinHash sketches. sketches. MinHash using reads noisy of overlapping Rapid 5 c and ). For human, using a small value of that maintains sensitivity. maintains that and metagenomic clustering compares the total number of 2 k -mers each for and Online Methods). Sensitivity can be further further be can Sensitivity Methods). Online and k -mers. In this example, the sketches of k- e ) If sufficient similarity is detected between two two between detected is similarity sufficient ) If mers between two sequences ( sequences two between mers H min-mer fingerprints, which is much smaller smaller much is which fingerprints, min-mer c ) The sketch of a sequence is composed composed is a sequence of sketch ) The k H = 16 by default ( default by 16 = . In practice, S Fig. Fig. 1 Supplementary Fig. 1 Fig. Supplementary k 2 and and -mers. In the example shown, 0 , image similarity , image 1 Berlin et al (2015) Assembling large genomes with single-molecule sequencing andlocality-sensitive hashing and Online Methods). This local This Methods). Online and S 2 . ( 1 ), subsequent fingerprints are are fingerprints subsequent ), 2 b 6 k 2 H ) All . The approach can also be be also can approach The . 7 -mers (also known as, shin . Briefly, to create . a to Briefly, create sketch >> 4 is required to obtain >> obtain to required 4 is k Fig. Fig. 2 S k -mers counted during during counted -mers k 1 (e.g., 10) increases the -mers are then then are -mers S and and , we first decomposed decomposed first , we ( in the be estimatedbytheoverlap 4) TheJaccard similaritycan each hashfunctionis chosen 3) Thesmallestvalues for Nature Biotechnology ( 2) Multiplehashfunctions into 1) Sequencedecomposed Minhash) b H k 2 Γ 1 2 , -mers is propor is -mers . Here, where where . Here, S Supplementary Supplementary , sequence simi and and )mapkmerstovalues. Supplementary Supplementary 2 ). Additionally, Additionally, ). . S kmers 1 Min and and k 2 -mer sets sets -mer . ( k S = 3, 2… S d imum The MinhashSketch 1 ) The ) The 2 and and

H

). ).

- - - - -

Hash trols how many alignments are reported for each read. The HGAP sensitivityis primarily affected by the balanced that ( accuracy with chosen speed were sensitive) and (fast settings parameter ( sweep parameter the on Based threshold. similarity Jaccard the and Methods). Online and magnitude faster than BLASR at all levels of sensitivity sensitivity of levels all at BLASR than ( faster magnitude of order an and it tested genomes all alignments; across pairs accurate possible consistently was all all considers overlapping MHAP for contrast, suited In reads. ideally mapping of not for is and designed reference a originally to was BLASR BWA-MEM, Like assembler sets sets assembler ( 2 Tables chemistries Supplementary sequencing and settings parameter multiple using evaluated were tools the and genomes, reference to reads ping comparing by map from inferred were overlaps,which true overlapsto evaluated detected was DALIGNER and BLASR MHAP, of overlaps noisy all between pairs of reads ( detect reliably not did algorithms these of versions BWA-MEM BLASR reads, SMRT for designed tools other overlap of MHAP two per.sensitive versus and Wespecificity sensitivity the evaluated highly a also is MHAP fast, being to addition In MHAP overlapping performance technologies. sequencing long-read future of accuracy and length read increasing the with improve to expected is 1 Note for read length; of increasing decreases reads (which plexity is governed only by the sketch size (a constant) and the number com the because length, read increasing with rapidly MHAP decays by performed comparisons min-mer of number relative the length, overlap 20% a and minimum sequenced, bases of total number fixed long-read overlaps. Although developed for the Dazzler assembler, assembler, Dazzler the for developed Although overlaps. long-read ( coverage replicon uneven and complexity sequence by affected and BLASR genome-dependent genomes. highly was sensitivity and repetitive runtime for overlaps missed in result can this Supplementary Table 2 Supplementary Supplementary Figs. 2 Figs. Supplementary MHAP sensitivity is tunable based on the size of Like MHAP, DALIGNER utilizes efficient VOLUME 33 VOLUME es and and e d c b a Supplementary Table 1 Supplementary 3 0 , SNAP 18 20 33 24 22 33 40 58 14 19 [

54 5 S bestn 1 5 1 , NUMBER 6 Sketch : 13 28 48 28 23 37 57 14

1 84 36 75 1 2 ,

equal to the depth of sequencing coverage, but but coverage, sequencing of depth the to equal 3

1 2 Table Table 56 11 60 47 11 16 36 57 ( and RazerS and , S 24 72 04 2 3 1 )

and ) and empirical assembly tests, two MHAP two tests, assembly empirical and ) 15 Supplementary Note 3 Supplementary S S and 39 54 43 26 54 15 19 36 61 ] 2 1

1 6 5 4 J : : 1 3 ( JUNE 2015 S and 3 ). 1 , min-mers and S 2 ) 3 Supplementary Note 2 Note Supplementary ≈ 2 ). Thus, the efficiency of MHAP of efficiency the Thus, ).

were also evaluated, but current were evaluated, also 2/4 =0.5 Supplementary Figs. 2 2 Figs. Supplementary bestn

NATURE NATURE BIOTECHNOLOGY 27 54 13 35 24 22 49 44 11 18 36 k 5 [ 1 parameter, which con 5 -mer matching to detect , Sketch 2 33 56 30 48 44 27 54 13 19 8

64 75 1 1 2 and DALIGNER

, ). ). The performance

6 28 39 60 47 27 33 56 14 ( k S 3 , 94 04 95 6 Supplementary Supplementary , the sketch size size sketch , the 2 : )

S 6 2 11 18 43 26 49 28 39 57 ] 4 4 2 5 6 ). BLASR BLASR ). Table Table Table Table and 1 2 3 1 1 9 ). ). 5 - - - - , .

© 2015 Nature America, Inc. All rights reserved. computed to determine the overlap offset (0) for is positions their in difference median the and sequences original the in located are case) this in CCG and (ACC min-mers shared the sketches, accurate estimates. ( the error bound controlled by S ARTICLES 624 fraction of entries shared between the sketches of two sequences share the same minimum fingerprints (underlined) for than the set of all of the ordered set of its the min-mer for that hash. ( The generated using an XORShift pseudo-random number generator ( ( H of hash functions determines the resulting sketch size number The functions. hash multiple by fingerprints integer to converted resulting in 12 the sequence into its constituent ( Figure 1 a For sketches. using without reads two between similarity Jaccard the measures exactly that approach a direct with MHAP overlapping 2 Figure sensitivity. 80% over achieve to required is 16-mers ~150 of only a sketch rate, error a 15% with genome human the to reads kbp 10 mapping For overlapping. than easier is genome the in reference a to additive reads high-error mapping roughly reads, two is the of rate error alignment an of rate error the because ( estimate Jaccard the of error expected the reduces which size, sketch the increasing by improved 1 Notes rate of 30%, so MHAP uses 10 kbp reads simulated from the human genome with an overlap error of value largest the use to preferable is it so found, matches false of number ( input overlap maintaining accuracy reads, acceptable while detection the than smaller magnitude of order an sketches use to possible Itis filter. the of sensitivity small. the reduces sketches min-mers keep fewer to using However, preferable is it so size, sketch the to tional storeindex,requiredcomparehash,andtimeto The filtering. alignment efficient for sketches MinHash uses MHAP MinHash alignment filtering RESULTS similarity. estimating for technique efficient putationally 1 Fig. shared of number the with correlated strongly is estimate resulting The sketches. their to be by estimated simply computing the Hamming between distance two of similarity Jaccard the allows hashing ity-sensitive a sequence makes the sketch ( for min-mers of collection The retained. is min-mer, or fingerprint, valued minimum the only function, hash each For functions. hash glesor for a DNA one must sequence, convert all of minimizers as a generalization viewed larity similarity document to applied a Fig. Fig. 2a 2 = 4, four independent hash sets are generated for each sequence sequence each for generated are sets hash independent four = 4, ) To create a MinHash sketch of a DNA sequence 1… (0.5) serves as an estimate of their true Jaccard similarity (0.22), with with (0.22), similarity Jaccard true their of estimate an as serves (0.5) The efficiency of MHAP improves with increased read length. length. read increased with improves MHAP of efficiency The from overlaps kbp 2 detect effectively can 16-mers Specifically,

k H -mer generating the minimum value for each hash is referred to as to referred is hash each for value minimum the generating -mer 23– ). In MHAP, after the initial hash ( ). Because the sketches are comparatively small, this is a com a is this small, comparatively are sketches the Because ). q

, 2 b k -grams) to integer fingerprints using multiple, randomized randomized multiple, using fingerprints integer to -grams) Rapid overlapping of noisy reads using MinHash sketches. sketches. MinHash using reads noisy of overlapping Rapid 5 c and ). For human, using a small value of that maintains sensitivity. maintains that and metagenomic clustering compares the total number of 2 k -mers each for and Online Methods). Sensitivity can be further further be can Sensitivity Methods). Online and k -mers. In this example, the sketches of k- e ) If sufficient similarity is detected between two two between detected is similarity sufficient ) If mers between two sequences ( sequences two between mers H min-mer fingerprints, which is much smaller smaller much is which fingerprints, min-mer c ) The sketch of a sequence is composed composed is a sequence of sketch ) The k H = 16 by default ( default by 16 = . In practice, S Fig. Fig. 1 Supplementary Fig. 1 Fig. Supplementary k 2 and and -mers. In the example shown, 0 , image similarity , image 1 Berlin et al (2015) Assembling large genomes with single-molecule sequencing andlocality-sensitive hashing and Online Methods). This local This Methods). Online and S 2 . ( 1 ), subsequent fingerprints are are fingerprints subsequent ), 2 b 6 k 2 H ) All . The approach can also be be also can approach The . 7 -mers (also known as, shin . Briefly, to create . a to Briefly, create sketch >> 4 is required to obtain >> obtain to required 4 is k Fig. Fig. 2 S k -mers counted during during counted -mers k 1 (e.g., 10) increases the -mers are then then are -mers S and and , we first decomposed decomposed first , we ( in the be estimatedbytheoverlap 4) TheJaccard similaritycan each hashfunctionis chosen 3) Thesmallestvalues for Nature Biotechnology ( 2) Multiplehashfunctions into 1) Sequencedecomposed Minhash) b H k 2 Γ 1 2 , -mers is propor is -mers . Here, where where . Here, S Supplementary Supplementary , sequence simi and and )mapkmerstovalues. Supplementary Supplementary 2 ). Additionally, Additionally, ). . S kmers 1 Min and and k 2 -mer sets sets -mer . ( k S = 3, 2… S d imum The MinhashSketch 1 ) The ) The 2 and and

H

). ).

- - - - -

Hash trols how many alignments are reported for each read. The HGAP sensitivityis primarily affected by the balanced that ( accuracy with chosen speed were sensitive) and (fast settings parameter ( sweep parameter the on Based threshold. similarity Jaccard the and Methods). Online and magnitude faster than BLASR at all levels of sensitivity sensitivity of levels all at BLASR than ( faster magnitude of order an and it tested genomes all alignments; across pairs accurate possible consistently was all all considers overlapping MHAP for contrast, suited In reads. ideally mapping of not for is and designed reference a originally to was BLASR BWA-MEM, Like assembler sets sets assembler ( 2 Tables chemistries Supplementary sequencing and settings parameter multiple using evaluated were tools the and genomes, reference to reads ping comparing by map from inferred were overlaps,which true overlapsto evaluated detected was DALIGNER and BLASR MHAP, of overlaps noisy all between pairs of reads ( detect reliably not did algorithms these of versions BWA-MEM BLASR reads, SMRT for designed tools other overlap of MHAP two per.sensitive versus and Wespecificity sensitivity the evaluated highly a also is MHAP fast, being to addition In MHAP overlapping performance technologies. sequencing long-read future of accuracy and length read increasing the with improve to expected is 1 Note for read length; of increasing decreases reads (which plexity is governed only by the sketch size (a constant) and the number com the because length, read increasing with rapidly MHAP decays by performed comparisons min-mer of number relative the length, overlap 20% a and minimum sequenced, bases of total number fixed long-read overlaps. Although developed for the Dazzler assembler, assembler, Dazzler the for developed Although overlaps. long-read ( coverage replicon uneven and complexity sequence by affected and BLASR genome-dependent genomes. highly was sensitivity and repetitive runtime for overlaps missed in result can this Supplementary Table 2 Supplementary Supplementary Figs. 2 Figs. Supplementary MHAP sensitivity is tunable based on the size of Like MHAP, DALIGNER utilizes efficient VOLUME 33 VOLUME es and and e d c b a Supplementary Table 1 Supplementary 3 0 , SNAP 18 20 33 24 22 33 40 58 14 19 [

54 5 S bestn 1 5 1 , NUMBER 6 Sketch : 13 28 48 28 23 37 57 14

1 84 36 75 1 2 ,

equal to the depth of sequencing coverage, but but coverage, sequencing of depth the to equal 3

1 2 Table Table 56 11 60 47 11 16 36 57 ( and RazerS and , S 24 72 04 2 3 1 )

and ) and empirical assembly tests, two MHAP two tests, assembly empirical and ) 15 Supplementary Note 3 Supplementary S S and 39 54 43 26 54 15 19 36 61 ] 2 1

1 6 5 4 J : : 1 3 ( JUNE 2015 S and 3 ). 1 , min-mers and S 2 ) 3 Supplementary Note 2 Note Supplementary ≈ 2 ). Thus, the efficiency of MHAP of efficiency the Thus, ).

were also evaluated, but current were evaluated, also 2/4 =0.5 Supplementary Figs. 2 2 Figs. Supplementary bestn

NATURE NATURE BIOTECHNOLOGY 27 54 13 35 24 22 49 44 11 18 36 k 5 [ 1 parameter, which con 5 -mer matching to detect , Sketch 2 33 56 30 48 44 27 54 13 19 8

64 75 1 1 2 and DALIGNER

, ). ). The performance

6 28 39 60 47 27 33 56 14 ( k S 3 , 94 04 95 6 Supplementary Supplementary , the sketch size size sketch , the 2 : )

S 6 2 11 18 43 26 49 28 39 57 ] 4 4 2 5 6 ). BLASR BLASR ). Table Table Table Table and 1 2 3 1 1 9 ). ). 5 - - - - , .

© 2015 Nature America, Inc. All rights reserved. computed to determine the overlap offset (0) for is positions their in difference median the and sequences original the in located are case) this in CCG and (ACC min-mers shared the sketches, accurate estimates. ( the error bound controlled by S ARTICLES 624 fraction of entries shared between the sketches of two sequences share the same minimum fingerprints (underlined) for than the set of all of the ordered set of its the min-mer for that hash. ( The generated using an XORShift pseudo-random number generator ( ( H of hash functions determines the resulting sketch size number The functions. hash multiple by fingerprints integer to converted resulting in 12 the sequence into its constituent ( Figure 1 a For sketches. using without reads two between similarity Jaccard the measures exactly that approach a direct with MHAP overlapping 2 Figure sensitivity. 80% over achieve to required is 16-mers ~150 of only a sketch rate, error a 15% with genome human the to reads kbp 10 mapping For overlapping. than easier is genome the in reference a to additive reads high-error mapping roughly reads, two is the of rate error alignment an of rate error the because ( estimate Jaccard the of error expected the reduces which size, sketch the increasing by improved 1 Notes rate of 30%, so MHAP uses 10 kbp reads simulated from the human genome with an overlap error of value largest the use to preferable is it so found, matches false of number ( input overlap maintaining accuracy reads, acceptable while detection the than smaller magnitude of order an sketches use to possible Itis filter. the of sensitivity small. the reduces sketches min-mers keep fewer to using However, preferable is it so size, sketch the to tional storeindex,requiredcomparehash,andtimeto The filtering. alignment efficient for sketches MinHash uses MHAP MinHash alignment filtering RESULTS similarity. estimating for technique efficient putationally 1 Fig. shared of number the with correlated strongly is estimate resulting The sketches. their to be by estimated simply computing the Hamming between distance two of similarity Jaccard the allows hashing ity-sensitive a sequence makes the sketch ( for min-mers of collection The retained. is min-mer, or fingerprint, valued minimum the only function, hash each For functions. hash glesor for a DNA one must sequence, convert all of minimizers as a generalization viewed larity similarity document to applied a Fig. Fig. 2a 2 = 4, four independent hash sets are generated for each sequence sequence each for generated are sets hash independent four = 4, ) To create a MinHash sketch of a DNA sequence 1… (0.5) serves as an estimate of their true Jaccard similarity (0.22), with with (0.22), similarity Jaccard true their of estimate an as serves (0.5) The efficiency of MHAP improves with increased read length. length. read increased with improves MHAP of efficiency The from overlaps kbp 2 detect effectively can 16-mers Specifically,

k H -mer generating the minimum value for each hash is referred to as to referred is hash each for value minimum the generating -mer 23– ). In MHAP, after the initial hash ( ). Because the sketches are comparatively small, this is a com a is this small, comparatively are sketches the Because ). q

, 2 b k -grams) to integer fingerprints using multiple, randomized randomized multiple, using fingerprints integer to -grams) Rapid overlapping of noisy reads using MinHash sketches. sketches. MinHash using reads noisy of overlapping Rapid 5 c and ). For human, using a small value of that maintains sensitivity. maintains that and metagenomic clustering compares the total number of 2 k -mers each for and Online Methods). Sensitivity can be further further be can Sensitivity Methods). Online and k -mers. In this example, the sketches of k- e ) If sufficient similarity is detected between two two between detected is similarity sufficient ) If mers between two sequences ( sequences two between mers H min-mer fingerprints, which is much smaller smaller much is which fingerprints, min-mer c ) The sketch of a sequence is composed composed is a sequence of sketch ) The k H = 16 by default ( default by 16 = . In practice, S Fig. Fig. 1 Supplementary Fig. 1 Fig. Supplementary k 2 and and -mers. In the example shown, 0 , image similarity , image 1 Berlin et al (2015) Assembling large genomes with single-molecule sequencing andlocality-sensitive hashing and Online Methods). This local This Methods). Online and S 2 . ( 1 ), subsequent fingerprints are are fingerprints subsequent ), 2 b 6 k 2 H ) All . The approach can also be be also can approach The . 7 -mers (also known as, shin . Briefly, to create . a to Briefly, create sketch >> 4 is required to obtain >> obtain to required 4 is k Fig. Fig. 2 S k -mers counted during during counted -mers k 1 (e.g., 10) increases the -mers are then then are -mers S and and , we first decomposed decomposed first , we ( in the be estimatedbytheoverlap 4) TheJaccard similaritycan each hashfunctionis chosen 3) Thesmallestvalues for Nature Biotechnology ( 2) Multiplehashfunctions into 1) Sequencedecomposed Minhash) b H k 2 Γ 1 2 , -mers is propor is -mers . Here, where where . Here, S Supplementary Supplementary , sequence simi and and )mapkmerstovalues. Supplementary Supplementary 2 ). Additionally, Additionally, ). . S kmers 1 Min and and k 2 -mer sets sets -mer . ( k S = 3, 2… S d imum The MinhashSketch 1 ) The ) The 2 and and

H

). ).

- - - - -

Hash trols how many alignments are reported for each read. The HGAP sensitivityis primarily affected by the balanced that ( accuracy with chosen speed were sensitive) and (fast settings parameter ( sweep parameter the on Based threshold. similarity Jaccard the and Methods). Online and magnitude faster than BLASR at all levels of sensitivity sensitivity of levels all at BLASR than ( faster magnitude of order an and it tested genomes all alignments; across pairs accurate possible consistently was all all considers overlapping MHAP for contrast, suited In reads. ideally mapping of not for is and designed reference a originally to was BLASR BWA-MEM, Like assembler sets sets assembler ( 2 Tables chemistries Supplementary sequencing and settings parameter multiple using evaluated were tools the and genomes, reference to reads ping comparing by map from inferred were overlaps,which true overlapsto evaluated detected was DALIGNER and BLASR MHAP, of overlaps noisy all between pairs of reads ( detect reliably not did algorithms these of versions BWA-MEM BLASR reads, SMRT for designed tools other overlap of MHAP two per.sensitive versus and Wespecificity sensitivity the evaluated highly a also is MHAP fast, being to addition In MHAP overlapping performance technologies. sequencing long-read future of accuracy and length read increasing the with improve to expected is 1 Note for read length; of increasing decreases reads (which plexity is governed only by the sketch size (a constant) and the number com the because length, read increasing with rapidly MHAP decays by performed comparisons min-mer of number relative the length, overlap 20% a and minimum sequenced, bases of total number fixed long-read overlaps. Although developed for the Dazzler assembler, assembler, Dazzler the for developed Although overlaps. long-read ( coverage replicon uneven and complexity sequence by affected and BLASR genome-dependent genomes. highly was sensitivity and repetitive runtime for overlaps missed in result can this Supplementary Table 2 Supplementary Supplementary Figs. 2 Figs. Supplementary MHAP sensitivity is tunable based on the size of Like MHAP, DALIGNER utilizes efficient VOLUME 33 VOLUME es and and e d c b a Supplementary Table 1 Supplementary 3 0 , SNAP 18 20 33 24 22 33 40 58 14 19 [

54 5 S bestn 1 5 1 , NUMBER 6 Sketch : 13 28 48 28 23 37 57 14

1 84 36 75 1 2 ,

equal to the depth of sequencing coverage, but but coverage, sequencing of depth the to equal 3

1 2 Table Table 56 11 60 47 11 16 36 57 ( and RazerS and , S 24 72 04 2 3 1 )

and ) and empirical assembly tests, two MHAP two tests, assembly empirical and ) 15 Supplementary Note 3 Supplementary S S and 39 54 43 26 54 15 19 36 61 ] 2 1

1 6 5 4 J : : 1 3 ( JUNE 2015 S and 3 ). 1 , min-mers and S 2 ) 3 Supplementary Note 2 Note Supplementary ≈ 2 ). Thus, the efficiency of MHAP of efficiency the Thus, ).

were also evaluated, but current were evaluated, also 2/4 =0.5 Supplementary Figs. 2 2 Figs. Supplementary bestn

NATURE NATURE BIOTECHNOLOGY 27 54 13 35 24 22 49 44 11 18 36 k 5 [ 1 parameter, which con 5 -mer matching to detect , Sketch 2 33 56 30 48 44 27 54 13 19 8

64 75 1 1 2 and DALIGNER

, ). ). The performance

6 28 39 60 47 27 33 56 14 ( k S 3 , 94 04 95 6 Supplementary Supplementary , the sketch size size sketch , the 2 : )

S 6 2 11 18 43 26 49 28 39 57 ] 4 4 2 5 6 ). BLASR BLASR ). Table Table Table Table and 1 2 3 1 1 9 ). ). 5 - - - - , .

MHAP uses Minhash to

approximateARTICLES read overlap

Figure 4 Continuity and putative GRCh38 gap closures of the CHM1 assembly. Human chromosomes are painted with assembled CHM1 contigs using the colored Chromosomes package59. Alternating shades indicate adjacent contigs, so each vertical transition from gray to black represents a contig boundary or alignment breakpoint. The left half of each chromosome shows the MHAP assembly of the SMRT data set and the right half shows the Illumina-based assembly42. The SMRT assembly is considerably more continuous, with an average of less than 100 contigs per chromosome. Putative GRCh38 gap closures are shown as red dashes next to the spanned gap position. Most SMRT closures fall near the telomeres and centromeres. The three gaps spanned by the Illumina assembly are not associated with the primary chromosomes and cannot be displayed. Tiling gaps, in white, coincide with missing assembly sequence and/or regions of uncharacterized reference sequence (i.e., long stretches of N’s) to which no contigs could be mapped. A full list of putative gap closures is given in Supplementary Table 6.

in a single alignment to a single contig (vs. 16,189 for BLASR). Of these, 16,751 were reconstructed at over 99% identity and 14,824 were reconstructed with perfect identity (vs. 12,591 for BLASR). Because repeats represent the greatest challenge to assembly, we 1 2345678910 11 12 also analyzed the completeness of D. melanogaster transposable element (TE) families. TEs in D. melanogaster represent a large fraction of annotated repeats and vary over a wide range of sizes and levels of sequence diversity49. To assess TE resolution in our assem- bly, we replicated the analysis of a recent study50 that assembled D. melanogaster from Illumina Synthetic Long Reads (Moleculo) using Celera Assembler. The SMRT assembly is hundreds of fold more continuous than the Moleculo assembly (NG50 21 Mbp vs. 70 kbp), partially owing to higher sequencing coverage (90× vs. 34×) but primarily because of longer read length and reduced sequencing bias. Of 5,425 annotated TE elements in the euchromatic arms49, 5,274

(97%) are contained in a single contig by the MHAP assembly, and 13 14 15 16 17 18 19 20 21 22 X the majority (4,984) aligned perfectly to the reference. For the highly abundant roo TE family, 97% (134 of 138) of copies were resolved, comprising at least 50% of the terminal telomeric repeat on both the !16 with 93 at 100% identity. This is in contrast to the results of McCoy left and right ends. This indicates that a majority of the 16 chromo- 50

Nature America, Inc. All rights reserved. Inc. Nature America, et al. , where only 5.2% of roo copies were resolved using Moleculo. somes were completely resolved from telomere to telomere. In the

5 For the juan family, with less than 0.01% divergence between copies, remaining cases, four chromosomes were composed of more than all 11 copies were reconstructed at perfect identity by the SMRT one contig containing the telomeres, and two chromosomes did not

© 201 assembly. We conclude that the high error rate of SMRT sequencing extend into the telomeres (or the telomeric sequence did not match does not prohibit the accurate reconstruction of TE sequences, the reference). This is a substantial improvement compared with the even from highly abundant TE families with many identical copies current S. cerevisiase W303 genome36, in which only one chromo- interspersed throughout the genome. some is spanned from end to end and only five chromosome ends have been annotated. Improved telomere assemblies In contrast to the simple telomeric repeats of S. cerevisiae and other Because SMRT sequencing generates long reads without the need for eukaryotes, D. melanogaster telomeres are composed of head-to-tail cloning or amplification, it is possible to better reconstruct the repeat- arrays of three specialized retrotransposable elements (Het-A, TART rich heterochromatic regions of eukaryotic chromosomes. This is a and Tahre) and clusters of telomere-associated sequences (TASs)53. distinct advantage compared to previous sequencing methods, for Because telomeric TEs preferentially transpose to chromosome ends, which heterochromatin sequencing was thought to be impossible they are virtually absent from the euchromatic regions53. By mapping because of cloning biases or short read lengths. As proof of principle, repeat families from RepBase54 and a recent D. melanogaster repeat we evaluated the ability of long-read sequencing to reconstruct het- study55, we identified repeat arrays characteristic of D. melanogaster erochromatic sequences in the telomeric regions of S. cerevisiae51 and telomeres (Supplementary Note 8). A total of 24 telomeric contigs are D. melanogaster38 (Supplementary Note 8). Telomeres play important present in the MHAP assembly. One contig, corresponding to chro- roles in chromosome replication in all eukaryotic genomes, and in mosome 2R, contains both subtelomeric (HetRp_DM) and telomeric humans their loss has been associated with disease52. However, these sequence, fully capturing the transition from euchromatin to hetero- sequences are typically missing from de novo assemblies in eukaryo- chromatin. This contig extends 80 kbp beyond the end of the reference tes. For example, it was not until 6 years after the initial shotgun assembly, and represents the first time the full telomeric transition assembly of D. melanogaster that the reference genome began to sequence has been identified for this chromosome. Chromosome include telomeric sequence53. arms 2L, 3R and X also have large contigs containing telomeric repeats We mapped known S. cerevisiae telomeric repeats to the MHAP extending past the end of the current reference sequence, indicating S. cerevisiae W303 assembly (Supplementary Note 8) and identi- additional regions of the reference that could be improved by the fied nine chromosomes where a single contig included an alignment de novo MHAP assembly.

628 VOLUME 33 NUMBER 6 JUNE 2015 NATURE BIOTECHNOLOGY MHAP uses Minhash to

approximateARTICLES read overlap

Figure 4 Continuity and putative GRCh38 gap closures of the CHM1 assembly. Human chromosomes are painted with assembled CHM1 contigs using the colored Chromosomes package59. Alternating shades indicate adjacent contigs, so each vertical transition from gray to black An improvementrepresents a incontig heuristic boundary or alignment effi breakpoint.ciency The left half of each chromosome shows the MHAP assembly of the SMRT data set and leads to anthe improvement right half shows the Illumina-based in accuracy assembly42. The SMRT assembly is considerably more continuous, with an average of less than 100 contigs per chromosome. Putative GRCh38 gap closures are shown as red dashes next to the spanned gap position. Most SMRT closures fall near the telomeres and centromeres. The three gaps spanned by the Illumina assembly are not associated with the primary chromosomes and cannot be displayed. Tiling gaps, in white, coincide with missing assembly sequence and/or regions of uncharacterized reference sequence (i.e., long stretches of N’s) to which no contigs could be mapped. A full list of putative gap closures is given in Supplementary Table 6.

in a single alignment to a single contig (vs. 16,189 for BLASR). Of these, 16,751 were reconstructed at over 99% identity and 14,824 were reconstructed with perfect identity (vs. 12,591 for BLASR). Because repeats represent the greatest challenge to assembly, we 1 2345678910 11 12 also analyzed the completeness of D. melanogaster transposable element (TE) families. TEs in D. melanogaster represent a large fraction of annotated repeats and vary over a wide range of sizes and levels of sequence diversity49. To assess TE resolution in our assem- bly, we replicated the analysis of a recent study50 that assembled D. melanogaster from Illumina Synthetic Long Reads (Moleculo) using Celera Assembler. The SMRT assembly is hundreds of fold more continuous than the Moleculo assembly (NG50 21 Mbp vs. 70 kbp), partially owing to higher sequencing coverage (90× vs. 34×) but primarily because of longer read length and reduced sequencing bias. Of 5,425 annotated TE elements in the euchromatic arms49, 5,274

(97%) are contained in a single contig by the MHAP assembly, and 13 14 15 16 17 18 19 20 21 22 X the majority (4,984) aligned perfectly to the reference. For the highly abundant roo TE family, 97% (134 of 138) of copies were resolved, comprising at least 50% of the terminal telomeric repeat on both the !16 with 93 at 100% identity. This is in contrast to the results of McCoy left and right ends. This indicates that a majority of the 16 chromo- 50

Nature America, Inc. All rights reserved. Inc. Nature America, et al. , where only 5.2% of roo copies were resolved using Moleculo. somes were completely resolved from telomere to telomere. In the

5 For the juan family, with less than 0.01% divergence between copies, remaining cases, four chromosomes were composed of more than all 11 copies were reconstructed at perfect identity by the SMRT one contig containing the telomeres, and two chromosomes did not

© 201 assembly. We conclude that the high error rate of SMRT sequencing extend into the telomeres (or the telomeric sequence did not match does not prohibit the accurate reconstruction of TE sequences, the reference). This is a substantial improvement compared with the even from highly abundant TE families with many identical copies current S. cerevisiase W303 genome36, in which only one chromo- interspersed throughout the genome. some is spanned from end to end and only five chromosome ends have been annotated. Improved telomere assemblies In contrast to the simple telomeric repeats of S. cerevisiae and other Because SMRT sequencing generates long reads without the need for eukaryotes, D. melanogaster telomeres are composed of head-to-tail cloning or amplification, it is possible to better reconstruct the repeat- arrays of three specialized retrotransposable elements (Het-A, TART rich heterochromatic regions of eukaryotic chromosomes. This is a and Tahre) and clusters of telomere-associated sequences (TASs)53. distinct advantage compared to previous sequencing methods, for Because telomeric TEs preferentially transpose to chromosome ends, which heterochromatin sequencing was thought to be impossible they are virtually absent from the euchromatic regions53. By mapping because of cloning biases or short read lengths. As proof of principle, repeat families from RepBase54 and a recent D. melanogaster repeat we evaluated the ability of long-read sequencing to reconstruct het- study55, we identified repeat arrays characteristic of D. melanogaster erochromatic sequences in the telomeric regions of S. cerevisiae51 and telomeres (Supplementary Note 8). A total of 24 telomeric contigs are D. melanogaster38 (Supplementary Note 8). Telomeres play important present in the MHAP assembly. One contig, corresponding to chro- roles in chromosome replication in all eukaryotic genomes, and in mosome 2R, contains both subtelomeric (HetRp_DM) and telomeric humans their loss has been associated with disease52. However, these sequence, fully capturing the transition from euchromatin to hetero- sequences are typically missing from de novo assemblies in eukaryo- chromatin. This contig extends 80 kbp beyond the end of the reference tes. For example, it was not until 6 years after the initial shotgun assembly, and represents the first time the full telomeric transition assembly of D. melanogaster that the reference genome began to sequence has been identified for this chromosome. Chromosome include telomeric sequence53. arms 2L, 3R and X also have large contigs containing telomeric repeats We mapped known S. cerevisiae telomeric repeats to the MHAP extending past the end of the current reference sequence, indicating S. cerevisiae W303 assembly (Supplementary Note 8) and identi- additional regions of the reference that could be improved by the fied nine chromosomes where a single contig included an alignment de novo MHAP assembly.

628 VOLUME 33 NUMBER 6 JUNE 2015 NATURE BIOTECHNOLOGY Mash uses Minhash to approximate genome and read overlap

Ref Seq Genome Clusterings Read set overlap

Mash: fast genome and metagenome distance estimation using MinHash Ondov et al (2016) Genome Biology Bloom Filters efficiently encode sets

A Bloom Filter is a length m -vector and associated hash function(s) H(X)

H(X)

m

Space/Time Trade-offs in Hash Coding with Allowable Errors. Burton Bloom. (1970) Communications of the ACM. Bloom Filters efficiently encode sets

A Bloom Filter is a length m bit-vector and associated hash function(s) H(X)

Hash Function H(X) takes in an arbitrary element and returns an integer Ex: H( ) = 12

H(X)

m

Space/Time Trade-offs in Hash Coding with Allowable Errors. Burton Bloom. (1970) Communications of the ACM. Bloom Filters efficiently encode sets

A Bloom Filter is a length m bit-vector and associated hash function(s) H(X)

Hash Function H(X) takes in an arbitrary element and returns an integer Ex: H( ) = 12

Set elements are inserted by setting an associated bit to 1 H(X)

m

Space/Time Trade-offs in Hash Coding with Allowable Errors. Burton Bloom. (1970) Communications of the ACM. Bloom Filters efficiently encode sets

A Bloom Filter is a length m bit-vector and associated hash function(s) H(X)

Hash Function H(X) takes in an arbitrary element and returns an integer Ex: H( ) = 12

Set elements are inserted by setting an associated bit to 1 H(X)

m Set elements are looked up by querying the associated bit

Space/Time Trade-offs in Hash Coding with Allowable Errors. Burton Bloom. (1970) Communications of the ACM. Bloom Filters efficiently encode sets

A Bloom Filter is a length m bit-vector and associated hash function(s) H(X)

Hash Function H(X) takes in an arbitrary element and returns an integer Ex: H( ) = 12

Set elements are inserted by setting an associated bit to 1 H(X)

m Set elements are looked up by querying the associated bit A false positive

A Bloom Filter is probabilistic !

Space/Time Trade-offs in Hash Coding with Allowable Errors. Burton Bloom. (1970) Communications of the ACM. Bloom Filters efficiently encode sets

A Bloom Filter is a length m bit-vector and associated hash function(s) H(X)

Hash Function H(X) takes in an arbitrary element and returns an integer Ex: H( ) = 12

Set elements are inserted by setting an associated bit to 1 H(X)

m Set elements are looked up by querying the associated bit A false positive

A Bloom Filter is probabilistic data structure!

Stores arbitrary sets and supports O(1) insertion and membership testing

Space/Time Trade-offs in Hash Coding with Allowable Errors. Burton Bloom. (1970) Communications of the ACM. The “Containment” Query Input: Set of individual sequencing studies ATGGTTAGAATTAAACCCGG TGCTAATAAACCUAGTGATG

CGATAGCACAGGTAGATCC TACGTAGAGGTCATTAGCC

TACGTAGAGGTCATTAGCCG TGCTAATAAACCUAGTGATG

…. Ex: TCGA, NIH SRA

Each study contains a set of raw reads

19 The “Containment” Query Input: Set of individual sequencing studies ATGGTTAGAATTAAACCCGG TGCTAATAAACCUAGTGATG

CGATAGCACAGGTAGATCC TACGTAGAGGTCATTAGCC

TACGTAGAGGTCATTAGCCG TGCTAATAAACCUAGTGATG

…. Ex: TCGA, NIH SRA

A query of interest ATGGTTAGAATTAAACCTGGATC Each study contains a set TGCTAATAAACCUAGTGATGATG CGATAGCACAGGTAGATCCAGT of raw reads TACGTAGAGGTCATTAGCCGTAT TGCTAATAAACCTAGTGATGATT CGATAGCGTAGAGGTCATTAGC CTTGTGCTAATAAACAGGTAGA TCCGTATACGTAGAGGTCATTA CCTTGTGCTAATAAACCTAGTG θ, the definition Ex: A novel transcript of containment 19 The “Containment” Query Input: Output: Set of individual All studies whose read sets sequencing studies can cover θ fraction of the ATGGTTAGAATTAAACCCGG TGCTAATAAACCUAGTGATG query

CGATAGCACAGGTAGATCC TACGTAGAGGTCATTAGCC

TACGTAGAGGTCATTAGCCG TGCTAATAAACCUAGTGATG

…. Ex: TCGA, NIH SRA

A query of interest ATGGTTAGAATTAAACCTGGATC Each study contains a set TGCTAATAAACCUAGTGATGATG CGATAGCACAGGTAGATCCAGT of raw reads TACGTAGAGGTCATTAGCCGTAT TGCTAATAAACCTAGTGATGATT CGATAGCGTAGAGGTCATTAGC CTTGTGCTAATAAACAGGTAGA TCCGTATACGTAGAGGTCATTA CCTTGTGCTAATAAACCTAGTG θ, the definition Ex: A novel transcript of containment 19 Finding relevant experiments is the first step in many large-scale analyses.

All studies whose Output reads cover the query

20 Finding relevant experiments is the first step in many large-scale analyses.

All studies whose Output reads cover the query

Functional Enrichment Gene enrichment Analysis associated with a query

20 Finding relevant experiments is the first step in many large-scale analyses.

All studies whose Output reads cover the query

Functional Enrichment Gene enrichment Analysis associated with a query

Identify novel SNPs Variant Calling within a population

20 Finding relevant experiments is the first step in many large-scale analyses.

All studies whose Output reads cover the query

Functional Enrichment Gene enrichment Analysis associated with a query

Identify novel SNPs Variant Calling within a population

Expression levels in a Expression Estimation population

20 The Sequence Bloom Tree

Bloom filter

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

Fast search of thousands of short-read sequencing experiments. Brad Solomon and Carl Kingsford. (2016) Nature Biotechnology. The Sequence Bloom Tree

1) Sequence reads are broken down into kmers

ATGGTTAGAATTAAA ATG TTA AAT AAA TGG TAG ATT GGT AGA TTA GTT GAA TAA

Bloom filter

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

Fast search of thousands of short-read sequencing experiments. Brad Solomon and Carl Kingsford. (2016) Nature Biotechnology. The Sequence Bloom Tree

1) Sequence reads are broken down into kmers 2) The set of all kmers in a study are stored in a bloom filter ATGGTTAGAATTAAA ATG TTA AAT AAA TAG ATG TAA TGG TAG ATT GGT AGA TTA GTT GAA TAA

Bloom filter

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

Fast search of thousands of short-read sequencing experiments. Brad Solomon and Carl Kingsford. (2016) Nature Biotechnology. The Sequence Bloom Tree

1) Sequence reads are broken down into kmers 2) The set of all kmers in a study are stored in a bloom filter ATGGTTAGAATTAAA ATG TTA AAT AAA TAG ATG TAA TGG TAG ATT GGT AGA TTA GTT GAA TAA

3) Nodes store the total sequence content of all Bloom filter leaves rooted at that node

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

Fast search of thousands of short-read sequencing experiments. Brad Solomon and Carl Kingsford. (2016) Nature Biotechnology. The Sequence Bloom Tree

1) Sequence reads are broken down into kmers 2) The set of all kmers in a study are stored in a bloom filter ATGGTTAGAATTAAA ATG TTA AAT AAA TAG ATG TAA TGG TAG ATT GGT AGA TTA GTT GAA TAA

3) Nodes store the total 4) Proving that this node sequence content of all Bloom filter can’t contain the query leaves rooted at that node “prunes” the tree

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

Fast search of thousands of short-read sequencing experiments. Brad Solomon and Carl Kingsford. (2016) Nature Biotechnology. Conceptual Construction

Bit-wise union of bloom filter produces bloom filter with total kmer content of each file

S1 S2 S3 S4 S5 S6

22 Conceptual Construction

Bit-wise union of bloom filter produces bloom filter with total kmer content of each file

S1 S2 S3 S4 S5 S6

22 Conceptual Construction

Bit-wise union of bloom filter produces bloom filter with total kmer content of each file

S1 S2 S3 S4 S5 S6

22 Conceptual Construction

Bit-wise union of bloom filter produces bloom filter with total kmer content of each file

S1 S2 S3 S4 S5 S6

22 Similarity Metrics for Bloom Filters

Hamming distance: Number of ‘substitutions’ to transform one vector to another

Hamming Distance = 4

Jaccard similarity: Intersection of 1- over union of 1-bits Intersection = 1 Union = 5

Similarity = 1/5

23 Similarity Metrics for Bloom Filters

Hamming distance: Number of ‘substitutions’ to transform one vector to another

Hamming Distance = 4

Compare these metrics by normalizing (and inverting)

Jaccard similarity: Intersection of 1-bits over union of 1-bits Intersection = 1 Union = 5

Similarity = 1/5

23 Similarity Metrics for Bloom Filters

Hamming distance: Number of ‘substitutions’ to transform one vector to another

Hamming Distance = 4

Compare these metrics by normalizing (and inverting)

Distance / Length = 4 / 10

Jaccard similarity: Intersection of 1-bits over union of 1-bits Intersection = 1 Union = 5

Similarity = 1/5

23 Similarity Metrics for Bloom Filters

Hamming distance: Number of ‘substitutions’ to transform one vector to another

Hamming Distance = 4

Compare these metrics by normalizing (and inverting)

Distance / Length = 4 / 10 (Similarity = 6/10)

Jaccard similarity: Intersection of 1-bits over union of 1-bits Intersection = 1 Union = 5

Similarity = 1/5

23 Similarity Metrics for Bloom Filters

Hamming distance: Number of ‘substitutions’ to transform one vector to another

Hamming Distance = 4

Compare these metrics by normalizing (and inverting)

Distance / Length = 4 / 10 (Similarity = 6/10)

Jaccard similarity: Intersection of 1-bits over union of 1-bits Intersection = 1 Union = 5

Similarity = 1/5

The Hamming similarity does not correlate with sequence content The Jaccard similarity correlates with sequence content 23 Similarity Metrics for Bloom Filters

Hamming distance: Number of ‘substitutions’ to transform one vector to another

Hamming Distance = 4

Compare these metrics by normalizing (and inverting)

Distance / Length = 4 / 10 (Similarity = 6/10)

Jaccard similarity: Intersection of 1-bits over union of 1-bits Intersection = 1 Union = 5

Similarity = 1/5

The Hamming similarity does not correlate with sequence content The Jaccard similarity correlates with sequence content 23 Why? Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

24 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

3/4 3/5

25 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

26 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

Global Pairwise Construction: Given all bloom filters, how can we build a tree?

27 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

Global Pairwise Construction: Given all bloom filters, how can we build a tree?

4/5 3/5

3/4 28 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

Global Pairwise Construction: Given all bloom filters, how can we build a tree?

29 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

Global Pairwise Construction: Given all bloom filters, how can we build a tree?

30 Building an SBT

Top Down Streaming: Given an existing tree (which may be empty) and a new bloom filter, how can we build a tree?

Global Pairwise Construction: Given all bloom filters, how can we build a tree?

What are the benefits? 30 What are the weaknesses? Querying a Sequence Bloom Tree

Good approximate sequence match == lots of shared kmers:

ATGGTTAGAATTAAACCTGGTTCTGCTAATAAACCTAGTGATGAT query … kmers{

Are ≥ θ fraction of query kmers ∈ this Bloom filter?

Bloom filter

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

31 Querying a Sequence Bloom Tree

Good approximate sequence match == lots of shared kmers:

ATGGTTAGAATTAAACCTGGTTCTGCTAATAAACCTAGTGATGAT query … kmers{

Are ≥ θ fraction of query kmers ∈ this Bloom filter?

If YES, move to children Bloom filter

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

31 Querying a Sequence Bloom Tree

Good approximate sequence match == lots of shared kmers:

ATGGTTAGAATTAAACCTGGTTCTGCTAATAAACCTAGTGATGAT query … kmers{

Are ≥ θ fraction of query kmers ∈ this Bloom filter?

If YES, move to children Bloom filter

If NO, stop looking at this subtree (Global mismatch)

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

31 Querying a Sequence Bloom Tree

Good approximate sequence match == lots of shared kmers:

ATGGTTAGAATTAAACCTGGTTCTGCTAATAAACCTAGTGATGAT query … kmers{

Are ≥ θ fraction of query kmers ∈ this Bloom filter?

If YES, move to children Bloom filter

If NO, stop looking at this subtree (Global mismatch)

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008

31 Querying a Sequence Bloom Tree

Good approximate sequence match == lots of shared kmers:

ATGGTTAGAATTAAACCTGGTTCTGCTAATAAACCTAGTGATGAT query … kmers{

Are ≥ θ fraction of query kmers ∈ this Bloom filter?

If YES, move to children Bloom filter

If NO, stop looking at this subtree (Global mismatch)

SRA 00001 SRA 00002 SRA 00003 SRA 00004 SRA 00005 SRA 00006 SRA 00007 SRA 00008 X X X X X X X

31 © 2016 Nature America, Inc. All rights reserved. 2 fraction of the query) and Sailfish’s definition of expressed (as estimated between SBT’s definition of present (coverage of taken as the ground truth of expression for the query transcripts. were estimates Sailfish respectively. TPM, These >100 and >500 with Low query sets were similarly chosen randomly from among transcripts abundance of >1,000 TPM in at least one experiment. The estimated Mediuman with transcripts random and 100 be to chosen was set High contained in the set of 100 experiments on which Sailfish was run. The be expressed at a higher, medium or low level in at least one experiment Low, which included transcripts of length >1,000 nt that were likely to queries were constructed using Sailfish, denoted by High, Medium and ( Sailfish ran we which on ried the entire tree, but estimated accuracy on a set of 100 random files tools to estimate expression over the entire set of experiments, we que to be expressed using Sailfish returned by SBT with those in which the query sequence was estimated To analyze the accuracy of the SBT filter, we compared the experiments Measuring the performance of SBT of factor by a SRA-BLAST of time SBTsSTAR query the overall or the data reduced to filter first the time to process the unfiltered dataset with these algorithms. Using of STAR with the performance or dataset on SRA-BLAST the filtered compared then We experiments. RNA-seq brain human and 2,652 breast of blood, consisting dataset full the filter to WeSBTs used . the first of size the than rather hits of number the of size the with to time scale processing subsequent the allows This present. are not sequences query the which in out experiments ruling by first SBTs can speed up the use of algorithms, such as STAR or SRA-BLAST, SBTs can speed up existing algorithms 5 Fig. ( experiment any in expressed not were that queries on particularly benefit, substantial provided SBT of els hierarchy the 3 Figs. million (TPM) threshold used to select the query set ( per transcripts as the a as well ‘hit’) to return in order must exist that varying sensitivity threshold ( approach achieve the same query results using SRA-BLAST and STAR. to time estimated the show bars other The thread. one and memory active in filter a single of a maximum using recorded was time per-query SBT Figure 1 A

Both false positives and false negatives can arise from a mismatch mismatch a from arise can negatives false and positives false Both NALYSIS

© 2016 Nature America, and Inc. All rights reserved. and

Estimated running times of search tools for one transcript. The The transcript. one for tools search of times running Estimated 32 Supplementary Fig. 2 Supplementary Table 4 Supplementary 4 A between SBT’s definition of present (coverage of taken as the ground truth of expression for the query transcripts. were estimates Sailfish respectively. TPM, These >100 and >500 with Low query sets were similarly chosen randomly from among transcripts abundance of >1,000 TPM in at least one experiment. The estimated Mediuman with transcripts random and 100 be to chosen was set High contained in the set of 100 experiments on which Sailfish was run. The be expressed at a higher, medium or low level in at least one experiment Low, which included transcripts of length >1,000 nt that were likely to queries were constructed using Sailfish, denoted by High, Medium and ( Sailfish ran we which on ried the entire tree, but estimated accuracy on a set of 100 random files tools to estimate expression over the entire set of experiments, we que to be expressed using Sailfish returned by SBT with those in which the query sequence was estimated To analyze the accuracy of the SBT filter, we compared the experiments Measuring the performance of SBT of factor by a SRA-BLAST of time SBTsSTAR query the overall or the data reduced to filter first the time to process the unfiltered dataset with these algorithms. Using of STAR with the performance or dataset on SRA-BLAST the filtered compared then We experiments. RNA-seq brain human and 2,652 breast of blood, consisting dataset full the filter to WeSBTs used database. the first of size the than rather hits of number the of size the with to time scale processing subsequent the allows This present. are not sequences query the which in out experiments ruling by first SBTs can speed up the use of algorithms, such as STAR or SRA-BLAST, SBTs can speed up existing algorithms 5 Fig. ( experiment any in expressed not were that queries on particularly benefit, substantial provided SBT of els hierarchy the 3 Figs. million (TPM) threshold used to select the query set ( per transcripts as the a as well ‘hit’) to return in order must exist that varying sensitivity threshold ( approach achieve the same query results using SRA-BLAST and STAR. to time estimated the show bars other The thread. one and memory active in filter a single of a maximum using recorded was time per-query SBT Figure 1 2 fraction of the query) and Sailfish’s definition of expressed (as estimated ). For approximately half of the queries, the upper lev upper the queries, the of half approximately For ).

Both false positives and false negatives can arise from a mismatch mismatch a from arise can negatives false and positives false Both NALYSIS Sequence BloomTree fast isavery © 2016 Nature America, Inc. All rights reserved. Time (min) and and solution forcontainmentqueries 10 10 10 10 10 10 10 and

1 3 4 5 6 7 2 Estimated running times of search tools for one transcript. The The transcript. one for tools search of times running Estimated Supplementary Fig. 2 A ried ried the entire tree, but estimated accuracy on a set of 100 random files tools to estimate expression over the entire set of experiments, we que to be expressed using Sailfish returned by SBT with those in which the query sequence was estimated To analyze the accuracy of the SBT filter, we compared the experiments Measuring the performance of SBT of factor by a SRA-BLAST of time SBTsSTAR query the overall or the data reduced to filter first the time to process the unfiltered dataset with these algorithms. Using of STAR with the performance or dataset on SRA-BLAST the filtered compared then We experiments. RNA-seq brain human and 2,652 breast of blood, consisting dataset full the filter to WeSBTs used database. the first of size the than rather hits of number the of size the with to time scale processing subsequent the allows This present. are not sequences query the which in out experiments ruling by first SBTs can speed up the use of algorithms, such as STAR or SRA-BLAST, SBTs can speed up existing algorithms 5 Fig. ( experiment any in expressed not were that queries on particularly benefit, substantial provided SBT of els hierarchy the 3 Figs. million (TPM) threshold used to select the query set ( per transcripts as the a as well ‘hit’) to return in order must exist that varying sensitivity threshold ( approach achieve the same query results using SRA-BLAST and STAR. to time estimated the show bars other The thread. one and memory active in filter a single of a maximum using recorded was time per-query SBT Figure 1 2 fraction of the query) and Sailfish’s definition of expressed (as estimated between SBT’s definition of present (coverage of taken as the ground truth of expression for the query transcripts. were estimates Sailfish respectively. TPM, These >100 and >500 with Low query sets were similarly chosen randomly from among transcripts abundance of >1,000 TPM in at least one experiment. The estimated Mediuman with transcripts random and 100 be to chosen was set High contained in the set of 100 experiments on which Sailfish was run. The be expressed at a higher, medium or low level in at least one experiment Low, which included transcripts of length >1,000 nt that were likely to queries were constructed using Sailfish, denoted by High, Medium and ( Sailfish ran we which on Supplementary Table 4 Supplementary 4

SBT Both false positives and false negatives can arise from a mismatch mismatch a from arise can negatives false and positives false Both NALYSIS ). For approximately half of the queries, the upper lev upper the queries, the of half approximately For ). Fig. Fig. SRA 2 3 ( and and 0 (the minimum fraction of query query of fraction minimum (the Time (min) . Because it is impractical to use existing -BLA and

2 Supplementary Fig. 6 Fig. Supplementary 10 © 2016 Nature America, Inc. All rights reserved. 10 10 10 10 10 10 Estimated running times of search tools for one transcript. The The transcript. one for tools search of times running Estimated ). Three collections of representativeof collections Three ). ). These queries were performed over ST 1 3 4 5 6 7 2 Supplementary Fig. 2 Supplementary Table 4 Supplementary 4 ).

). For approximately half of the queries, the upper lev upper the queries, the of half approximately For ). (CPU time) >2.5 >2.5 years A between SBT’s definition of present (coverage of taken as the ground truth of expression for the query transcripts. were estimates Sailfish respectively. TPM, These >100 and >500 with Low query sets were similarly chosen randomly from among transcripts abundance of >1,000 TPM in at least one experiment. The estimated Mediuman with transcripts random and 100 be to chosen was set High contained in the set of 100 experiments on which Sailfish was run. The be expressed at a higher, medium or low level in at least one experiment Low, which included transcripts of length >1,000 nt that were likely to queries were constructed using Sailfish, denoted by High, Medium and ( Sailfish ran we which on ried the entire tree, but estimated accuracy on a set of 100 random files tools to estimate expression over the entire set of experiments, we que to be expressed using Sailfish returned by SBT with those in which the query sequence was estimated To analyze the accuracy of the SBT filter, we compared the experiments Measuring the performance of SBT of factor by a SRA-BLAST of time SBTsSTAR query the overall or the data reduced to filter first the time to process the unfiltered dataset with these algorithms. Using of STAR with the performance or dataset on SRA-BLAST the filtered compared then We experiments. RNA-seq brain human and 2,652 breast of blood, consisting dataset full the filter to WeSBTs used database. the first of size the than rather hits of number the of size the with to time scale processing subsequent the allows This present. are not sequences query the which in out experiments ruling by first SBTs can speed up the use of algorithms, such as STAR or SRA-BLAST, SBTs can speed up existing algorithms 5 Fig. ( experiment any in expressed not were that queries on particularly benefit, substantial provided SBT of els hierarchy the 3 Figs. million (TPM) threshold used to select the query set ( per transcripts as the a as well ‘hit’) to return in order must exist that varying sensitivity threshold ( approach achieve the same query results using SRA-BLAST and STAR. to time estimated the show bars other The thread. one and memory active in filter a single of a maximum using recorded was time per-query SBT Figure 1 2 fraction of the query) and Sailfish’s definition of expressed (as estimated SBT ST

Fig. Fig. AR Both false positives and false negatives can arise from a mismatch mismatch a from arise can negatives false and positives false Both SRA Time (min) NALYSIS 2 3 ( 0 10 10 10 10 10 10 10 (the minimum fraction of query query of fraction minimum (the

. Because it is impractical to use existing -BLA and and 2 Supplementary Fig. 6 Fig. Supplementary 1 (1 2 3 4 5 6 7 ). Three collections of representativeof collections Three ). and

). These queries were performed over 5-thr STST Estimated running times of search tools for one transcript. The The transcript. one for tools search of times running Estimated

k- AR ead) ). Supplementary Fig. 2 SBT Supplementary Table 4 Supplementary mers over a sufficient 4 (CPU time) Fig. Fig.

). For approximately half of the queries, the upper lev upper the queries, the of half approximately For ). ST SRASRA-BLAST 2 3 ( AR 0 (the minimum fraction of query query of fraction minimum (the Supplementary Supplementary Supplementary Supplementary

. Because it is impractical to use existing -BLA ). >2 >2 days 2 Supplementary Fig. 6 Fig. Supplementary Time (min)cluster ). Three collections of representativeof collections Three ). NIH ). These queries were performed over 10 10 10 10 10 10 (1 ST 10 5-thr 1 3 4 5 6 7 ST 2 ).

k- AR (CPU time)ead) 19 mins 19 mers over a sufficient ST single single

SBT CPU k AR -mers Fig. Fig. SRA Supplementary Supplementary Supplementary Supplementary 2 3 ( ). 0

(the minimum fraction of query query of fraction minimum (the (1

. Because it is impractical to use existing -BLA 5-thr 2 - - Supplementary Fig. 6 Fig. Supplementary ST

). Three collections of representativeof collections Three ). k-

). These queries were performed over AR ead)ST mers over a sufficient and and be found with can hits true-positive of 100% queries, all of half than more In specificity. same experiments. Relaxing the on rates median the represent lines dashed rates, false-positive and and variable TPM >1,000 and >500 >100, expression estimated with queries 100 Figure 2 may open up new uses for these rich collections of data. makes it practical to search large sequencing This RNAs.repositories noncoding long or andisoforms unknown of expression the about sequences of interest, making it possible to identify, for example, Furthermore,SBTssources.different knowledge priorrequire not do that can be revealed only through the analysis of multiple data sets from uncoverto insights biological used dataandbe could ofmining these find the relevant conditions for further study. SBTs enable the efficient ficulty of not being able to test computational and hypotheses quickly or to consortia research dif same the at data and face a pace, rapid are groups collecting research centers, sequencing hospitals, Individual experiments. sequencing available from question research particular good use of the ever-growing collection of available sequencing data. ticular strain of bacteria. Fast search of this type will be essential use SBTsto make to identify metagenomic samples that are likely to contain a par from among thousands that are likely to express a given metagenomicnovel isoform collections or as well. Researchers could search scale forsequence searches,conditions not just for RNA-seq data, but for genomic and individualboth laboratories andsequencing centers supportto large- andcomputationalspeedThed. 92 take ofSBTsefficiency enablewill frame,estimatebuttimewe equivalent an searchusingSailfish would tools that can solve this scale of sequence search problem in a ( reasonable ( transcripts tissue-specific identify to results for the expression of all 214,293 known human Wetranscripts used SBT andto search usedall blood, thesebrain and breast SRA sequencing runs DISCUSSION strictest the but all for 100% was queries all across rate true-positive median the and rateat by This Sailfish. is supported by the fact that the average true-positive estimated as threshold TPM the above is expression their but results no reports SBT the which for queries outlier few a by driven marily of rates false-negative and (FPR) rates false-positive the by quantified is that disagreement two definitions are related, but not perfectly aligned, resulting in some by read These mappinginference). and an expectation-maximization ). Supplementary Fig. 8 Fig. Supplementary (CPU time) k Currently, it is difficult to access all the relevant data relating to a to relating data relevant the all access to difficult is Currently,it ST -mers Supplementary Supplementary Supplementary Supplementary

Supplementary Fig. 7 Supplementary AR ).

ADVANCE ONLINE PUBLICATION - - = 0.7 for queries that return at least one file was 96–100%, 96–100%, was file one least at return that queries for 0.7 =

(1 over averaged curve (ROC) characteristic operating Receiver

Tr ue positive 5-thr ST 0.2 0.4 0.6 0.8 1.

k- AR

ead) 0 0 and and be found with can hits true-positive of 100% queries, all of half than more In specificity. same experiments. Relaxing the on rates median the represent lines dashed rates, false-positive and and variable TPM >1,000 and >500 >100, expression estimated with queries 100 Figure 2 may open up new uses for these rich collections of data. algorithm makes it practical to search large sequencing This RNAs.repositories noncoding long or andisoforms unknown of expression the about sequences of interest, making it possible to identify, for example, Furthermore,SBTssources.different knowledge priorrequire not do that can be revealed only through the analysis of multiple data sets from uncoverto insights biological used dataandbe could ofmining these find the relevant conditions for further study. SBTs enable the efficient ficulty of not being able to test computational and hypotheses quickly or to consortia research dif same the at data and face a pace, rapid are groups collecting research centers, sequencing hospitals, Individual experiments. sequencing available from question research particular good use of the ever-growing collection of available sequencing data. ticular strain of bacteria. Fast search of this type will be essential use SBTsto make to identify metagenomic samples that are likely to contain a par from among thousands that are likely to express a given metagenomicnovel isoform collections or as well. Researchers could search scale forsequence searches,conditions not just for RNA-seq data, but for genomic and individualboth laboratories andsequencing centers supportto large- andcomputationalspeedThed. 92 take ofSBTsefficiency enablewill frame,estimatebuttimewe equivalent an searchusingSailfish would tools that can solve this scale of sequence search problem in a ( reasonable ( transcripts tissue-specific identify to results for the expression of all 214,293 known human Wetranscripts used SBT andto search usedall blood, thesebrain and breast SRA sequencing runs DISCUSSION strictest the but all for 100% was queries all across rate true-positive median the and rateat by This Sailfish. is supported by the fact that the average true-positive estimated as threshold TPM the above is expression their but results no reports SBT the which for queries outlier few a by driven marily of rates false-negative and (FPR) rates false-positive the by quantified is that disagreement two definitions are related, but not perfectly aligned, resulting in some by read These mappinginference). and an expectation-maximization k

(Online Methods). Solid lines represent mean true-positive true-positive mean represent lines Solid Methods). (Online

0

Supplementary Fig. 8 Fig. Supplementary

mers over a sufficient -mers

as high as 0.9. as high as

Currently, it is difficult to access all the relevant data relating to a to relating data relevant the all access to difficult is Currently,it Supplementary Fig. 7 Supplementary = Supplementary Supplementary ( Supplementary Supplementary 0.05 ). 1. - - Fig. Fig.

0

ADVANCE ONLINE PUBLICATION = 0.7 for queries that return at least one file was 96–100%, 96–100%, was file one least at return that queries for 0.7 = Receiver operating characteristic (ROC) curve averaged over over averaged curve (ROC) characteristic operating Receiver Tr ue positive Figure Figure 0.2 0.4 0.6 0.8 and and be found with can hits true-positive of 100% queries, all of half than more In specificity. same experiments. Relaxing the on rates median the represent lines dashed rates, false-positive and and variable TPM >1,000 and >500 >100, expression estimated with queries 100 Figure 2 may open up new uses for these rich collections of data. algorithm makes it practical to search large sequencing This RNAs.repositories noncoding long or andisoforms unknown of expression the about sequences of interest, making it possible to identify, for example, Furthermore,SBTssources.different knowledge priorrequire not do that can be revealed only through the analysis of multiple data sets from uncoverto insights biological used dataandbe could ofmining these find the relevant conditions for further study. SBTs enable the efficient ficulty of not being able to test computational and hypotheses quickly or to consortia research dif same the at data and face a pace, rapid are groups collecting research centers, sequencing hospitals, Individual experiments. sequencing available from question research particular good use of the ever-growing collection of available sequencing data. ticular strain of bacteria. Fast search of this type will be essential use SBTsto make to identify metagenomic samples that are likely to contain a par from among thousands that are likely to express a given metagenomicnovel isoform collections or as well. Researchers could search scale forsequence searches,conditions not just for RNA-seq data, but for genomic and individualboth laboratories andsequencing centers supportto large- andcomputationalspeedThed. 92 take ofSBTsefficiency enablewill frame,estimatebuttimewe equivalent an searchusingSailfish would tools that can solve this scale of sequence search problem in a ( reasonable ( transcripts tissue-specific identify to results for the expression of all 214,293 known human Wetranscripts used SBT andto search usedall blood, thesebrain and breast SRA sequencing runs DISCUSSION strictest the but all for 100% was queries all across rate true-positive median the and rateat by This Sailfish. is supported by the fact that the average true-positive estimated as threshold TPM the above is expression their but results no reports SBT the which for queries outlier few a by driven marily of rates false-negative and (FPR) rates false-positive the by quantified is that disagreement two definitions are related, but not perfectly aligned, resulting in some by read These mappinginference). and an expectation-maximization 1. ). There are presently no search or alignment or search no presentlyare There ). 2 Supplementary Fig. 8 Fig. Supplementary 0 0 ). 0.1 (Online Methods). Solid lines represent mean true-positive true-positive mean represent lines Solid Methods). (Online Currently, it is difficult to access all the relevant data relating to a to relating data relevant the all access to difficult is Currently,it 0 ). This search took 3.3 d using a single thread as high as 0.9. as high as k Supplementary Fig. 7 Supplementary 0 leads to a higher sensitivity at the cost of cost the at sensitivity a higher to leads -mers 0.9 = 2 (

0.05 ADVANCE ONLINE PUBLICATION . The observed false negatives are pri negatives false . observed The 1. = 0.7 for queries that return at least one file was 96–100%, 96–100%, was file one least at return that queries for 0.7 = Fig. Fig. F 0.1 Receiver operating characteristic (ROC) curve averaged over over averaged curve (ROC) characteristic operating Receiver Tr ue positive 0 alse positiv - - 0.2 0.4 0.6 0.8 1.

5 Figure Figure 0 0 ). There are presently no search or alignment or search no presentlyare There ). (Online Methods). Solid lines represent mean true-positive true-positive mean represent lines Solid Methods). (Online 2 0 and and be found with can hits true-positive of 100% queries, all of half than more In specificity. same experiments. Relaxing the on rates median the represent lines dashed rates, false-positive and and variable TPM >1,000 and >500 >100, expression estimated with queries 100 Figure 2 may open up new uses for these rich collections of data. algorithm makes it practical to search large sequencing This RNAs.repositories noncoding long or andisoforms unknown of expression the about sequences of interest, making it possible to identify, for example, Furthermore,SBTssources.different knowledge priorrequire not do that can be revealed only through the analysis of multiple data sets from uncoverto insights biological used dataandbe could ofmining these find the relevant conditions for further study. SBTs enable the efficient ficulty of not being able to test computational and hypotheses quickly or to consortia research dif same the at data and face a pace, rapid are groups collecting research centers, sequencing hospitals, Individual experiments. sequencing available from question research particular good use of the ever-growing collection of available sequencing data. ticular strain of bacteria. Fast search of this type will be essential use SBTsto make to identify metagenomic samples that are likely to contain a par from among thousands that are likely to express a given metagenomicnovel isoform collections or as well. Researchers could search scale forsequence searches,conditions not just for RNA-seq data, but for genomic and individualboth laboratories andsequencing centers supportto large- andcomputationalspeedThed. 92 take ofSBTsefficiency enablewill frame,estimatebuttimewe equivalent an searchusingSailfish would tools that can solve this scale of sequence search problem in a ( reasonable ( transcripts tissue-specific identify to results for the expression of all 214,293 known human Wetranscripts used SBT andto search usedall blood, thesebrain and breast SRA sequencing runs DISCUSSION strictest the but all for 100% was queries all across rate true-positive median the and rateat by This Sailfish. is supported by the fact that the average true-positive estimated as threshold TPM the above is expression their but results no reports SBT the which for queries outlier few a by driven marily of rates false-negative and (FPR) rates false-positive the by quantified is that disagreement two definitions are related, but not perfectly aligned, resulting in some by read These mappinginference). and an expectation-maximization ). as high as 0.9. as high as 0.1 0.8 Supplementary Fig. 8 Fig. Supplementary ). This search took 3.3 d using a single thread 0.20 Currently, it is difficult to access all the relevant data relating to a to relating data relevant the all access to difficult is Currently,it 0 leads to a higher sensitivity at the cost of cost the at sensitivity a higher to leads = ( e Supplementary Fig. 7 Supplementary 0.05 0.9 NATURE NATURE BIOTECHNOLOGY 0.7 1. Fig. Fig. 2 0 . The observed false negatives are pri negatives false . observed The F 0.1

0.25 Supplementary Table 5 Table Supplementary alse positiv ADVANCE ONLINE PUBLICATION Figure Figure = 0.7 for queries that return at least one file was 96–100%, 96–100%, was file one least at return that queries for 0.7 = Receiver operating characteristic (ROC) curve averaged over over averaged curve (ROC) characteristic operating Receiver ). There are presently no search or alignment or search no presentlyare There ). 0.6

5 Tr ue positive 2 0.2 0.4 0.6 0.8 1. ). 0.1 ). This search took 3.3 d using a single thread 0 0 0.5 (Online Methods). Solid lines represent mean true-positive true-positive mean represent lines Solid Methods). (Online 0.8 0 0 leads to a higher sensitivity at the cost of cost the at sensitivity a higher to leads 0.20 TPM as high as 0.9. as high as 0.30 0.9 2 e 1,000 500 10 = . The observed false negatives are pri negatives false . observed The NATURE NATURE BIOTECHNOLOGY 0.7 F ( 0.1 0 alse positiv 0.05 1. Fig. Fig. 5 0 0.25 Supplementary Table 5 Table Supplementary 0.35 0.6 Figure Figure ). There are presently no search or alignment or search no presentlyare There ). 0.8 2 0.20 ). 0.1 0.5 ). This search took 3.3 d using a single thread TPM e 0.30 0 leads to a higher sensitivity at the cost of cost the at sensitivity a higher to leads NATURE NATURE BIOTECHNOLOGY 0.7 1,000 500 10 0.9 2 0 0.25 Supplementary Table 5 Table Supplementary - - - . The observed false negatives are pri negatives false . observed The F 0.1

0.6 alse positiv 0.35 5 0.5 TPM 0.30 0.8 0.20 1,000 500 10 e 0 NATURE NATURE BIOTECHNOLOGY 0.7 - - - 0.35

0.25 Supplementary Table 5 Table Supplementary 0.6 0.5 TPM 0.30 1,000 500 10 - - -

0 0.35 - - - bioRxiv preprint first posted online Dec. 20, 2018; doi: http://dx.doi.org/10.1101/501726. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. Baker and Langmead Page 13 of 18

Figure 4 Relationship between maximum leading zero count (Max LZC) and set size for three randomly-generated sets of 8-bit numbers. The Max LZC roughly estimates the log2 of the set size, though with high variance; here, two of the three estimates are o↵ by 2-fold.

The HyperLogLog Sketch

HLL Cardinality Estimate Input items Register 000 ��⛺�� 01001 10001 2 ����� 2 ~ 2 ...... 10101 10110 ����� 00100

Register 001

hash Take 00100 10110 prefix 01011 10101 3 ~ 23 00010 01011 11111 11110 ...... 11110011 Register 010 1111001111110011 11110011 ...... 001 01001 Register 011 ...... 110 00001 ... p q ...... Hash values Register 111

Dashing:Figure Fast 5and Schematic Accurate Genomic of HyperLogLog Distances with sketch HyperLogLog.Inputitemsarehashedandthehashvalueis Danieldivided Baker and into Ben prefixLangmeadp and (2018) su bioRxivx q. p is used as an index into an array of registers. A register contains the maximum leading zero count among all suxes q that mapped there. This in turn can be used to estimate the cardinality of each register. Register-level estimates are then averaged (by harmonic mean) and corrected to obtain an overall cardinality estimate. Though hash values shown here are 8 bits, Dashing uses a 64-bit hash. While the figure shows 8 registers, a typical Dashing sketch will have on the order of thousands to millions of registers. bioRxiv preprint first posted online Dec. 20, 2018; doi: http://dx.doi.org/10.1101/501726. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. Baker and Langmead Page 13 of 18

Figure 4 Relationship between maximum leading zero count (Max LZC) and set size for three randomly-generated sets of 8-bit numbers. The Max LZC roughly estimates the log2 of the set size, though with high variance; here, two of the three estimates are o↵ by 2-fold.

The HyperLogLog Sketch

HLL Cardinality Estimate Input items Register 000 ��⛺�� 01001 10001 2 ����� 2 ~ 2 ...... 10101 10110 ����� 00100

Register 001

hash Take 00100 10110 prefix 01011 10101 3 ~ 23 00010 01011 11111 11110 ...... 11110011 Register 010 1111001111110011 11110011 ...... 001 01001 Register 011 ...... 110 00001 ... p q ...... Hash values Register 111 1) Input is hashed into prefix and suffix

Dashing:Figure Fast 5and Schematic Accurate Genomic of HyperLogLog Distances with sketch HyperLogLog.Inputitemsarehashedandthehashvalueis Danieldivided Baker and into Ben prefixLangmeadp and (2018) su bioRxivx q. p is used as an index into an array of registers. A register contains the maximum leading zero count among all suxes q that mapped there. This in turn can be used to estimate the cardinality of each register. Register-level estimates are then averaged (by harmonic mean) and corrected to obtain an overall cardinality estimate. Though hash values shown here are 8 bits, Dashing uses a 64-bit hash. While the figure shows 8 registers, a typical Dashing sketch will have on the order of thousands to millions of registers. bioRxiv preprint first posted online Dec. 20, 2018; doi: http://dx.doi.org/10.1101/501726. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. Baker and Langmead Page 13 of 18

Figure 4 Relationship between maximum leading zero count (Max LZC) and set size for three randomly-generated sets of 8-bit numbers. The Max LZC roughly estimates the log2 of the set size, though with high variance; here, two of the three estimates are o↵ by 2-fold.

The HyperLogLog Sketch

2) Prefix is used to bucket hash values HLL Cardinality Estimate Input items Register 000 ��⛺�� 01001 10001 2 ����� 2 ~ 2 ...... 10101 10110 ����� 00100

Register 001

hash Take 00100 10110 prefix 01011 10101 3 ~ 23 00010 01011 11111 11110 ...... 11110011 Register 010 1111001111110011 11110011 ...... 001 01001 Register 011 ...... 110 00001 ... p q ...... Hash values Register 111 1) Input is hashed into prefix and suffix

Dashing:Figure Fast 5and Schematic Accurate Genomic of HyperLogLog Distances with sketch HyperLogLog.Inputitemsarehashedandthehashvalueis Danieldivided Baker and into Ben prefixLangmeadp and (2018) su bioRxivx q. p is used as an index into an array of registers. A register contains the maximum leading zero count among all suxes q that mapped there. This in turn can be used to estimate the cardinality of each register. Register-level estimates are then averaged (by harmonic mean) and corrected to obtain an overall cardinality estimate. Though hash values shown here are 8 bits, Dashing uses a 64-bit hash. While the figure shows 8 registers, a typical Dashing sketch will have on the order of thousands to millions of registers. bioRxiv preprint first posted online Dec. 20, 2018; doi: http://dx.doi.org/10.1101/501726. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. Baker and Langmead Page 13 of 18

Figure 4 Relationship between maximum leading zero count (Max LZC) and set size for three randomly-generated sets of 8-bit numbers. The Max LZC roughly estimates the log2 of the set size, though with high variance; here, two of the three estimates are o↵ by 2-fold.

The HyperLogLog Sketch

2) Prefix is used to bucket hash values HLL Cardinality Estimate Input items Register 000 ��⛺�� 01001 10001 2 ����� 2 ~ 2 ...... 10101 10110 ����� 00100

Register 001

hash Take 00100 10110 prefix 01011 10101 3 ~ 23 00010 01011 11111 11110 ...... 11110011 Register 010 1111001111110011 11110011 ...... 001 01001 Register 011 ...... 110 00001 ... p q ...... Hash values Register 111 1) Input is hashed into prefix and suffix 3) The maximum leading zero is recorded. Dashing:Figure Fast 5and Schematic Accurate Genomic of HyperLogLog Distances with sketch HyperLogLog.Inputitemsarehashedandthehashvalueis Cardinality is estimated based on this value Danieldivided Baker and into Ben prefixLangmeadp and (2018) su bioRxivx q. p is used as an index into an array of registers. A register contains the maximum leading zero count among all suxes q that mapped there. This in turn can be used to estimate the cardinality of each register. Register-level estimates are then averaged (by harmonic mean) and corrected to obtain an overall cardinality estimate. Though hash values shown here are 8 bits, Dashing uses a 64-bit hash. While the figure shows 8 registers, a typical Dashing sketch will have on the order of thousands to millions of registers. bioRxiv preprint first posted online Dec. 20, 2018; doi: http://dx.doi.org/10.1101/501726. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. Dashing demonstrates Bakerdistance and Langmead difficulties for divergent data Page 5 of 18

Estimation error for computing Jaccard

J = 0.111 J = 0.0465 ● ● ● 1.00 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.75 0.75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.50 ● 0.50 ● ● ● ● ● ● abs error ● abs error ● ● ● ● Given an order of magnitude ● ● Δ ● ● ● ● Δ ● ● ● 0.25 ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● difference in data set size ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ●● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 12 14 16 ●18 20 22 10 12 14 16 ● 18 20● ● 22 ● Set 2 log2(sketch size) ● log2(sketch size) Set sizes (log2)

● 14, 17 Set 1 ● ● ● ● ● 0.06 ● ● 0.06 ● ● 17, 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20, 23 ● ● ● ● 0.04 ● ● 0.04 ● ● 23, 26 ● ● ● ● ● ● ● 26, 29 abs error abs error ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Δ Δ ● ● 0.02 ● 0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Δ abs error ● ● ● ● ●● ● ● ● ● ● ● ● ● 0.00 ● ● ● 0.00 ● ● ● ● ● ● ● ● ● Bloom − HLL 10 12 14 16 18 20 22 10 12 14 16 18 20 22 ● ● Bloom+ − HLL log2(sketch size)● log2(sketch size) ● MinHash − HLL 0.001 0.001 ●

HyperLogLog outperforms other ● ●

● ● ● 0.000 ● ● ● ● ● 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● sketch methods (most of the time) ● ● ● ● ● ● ● ● ● ● ● ● −0.001 ● ● −0.001 abs error abs error ● Δ Δ ● ● −0.002 −0.002 ●

10 12 14 16 18 20 22 10 12 14 16 18 20 22 log2(sketch size) log2(sketch size)

Figure 1 Jaccard-coecient estimation error for di↵erent-size sets using HLL, Bloom filters andThe MinHash. sketchLeft column size shows and a set of experimentsunderlying with various similarity set sizes but with can the true Jaccard-coecient fixed at 0.111. Right shows the same but for a true coecient of 0.0465. All affect accuracy3 !34 pairs of input sets di↵er in size by a factor of 2 =8.Thesecondandthirdrowszoomfurtherin with respect to the y-axis so as to highlight the relative performance of MinHash (2nd row) and the Bloom+ filter (3rd row).

For HLL, we repeated the experiment for three HLL cardinality estimation meth- ods: Flajolet’s canonical method using harmonic mean [16], and two maximum- likelihood-based methods (MLE and JMLE) proposed by Ertl [20]. We selected 400 pairs of bacterial genomes from RefSeq [26] covering a range of Jaccard- coecient values. To select the pairs, we first used dashing dist with s = 16, k = 31 and the MLE estimation method on the full set of 87,113 complete RefSeq assemblies (latest versions). We then selected a subset such that we kept 4 distinct genome pairs per Jaccard-coecient percentile. In this way, we started out with an even spread of Jaccard-coecient values, though some unevenness emerges later due to di↵erences between data structures and di↵erent selections of k. Of the genomes included in these pairs, the maximum, minimum and mean lengths were 11.7 Mbp, 308 Kbp, and 4.00 Mbp respectively. The Main Take-away:

• “Big Data” in genomics makes conventional analysis difficult

• Methods like Rail-RNA and recount2 try to improve efficiency through bulk analysis

• Sketch techniques trade accuracy for speed — and can often improve both

• Minhash and HyperLogLog provide rapid similarity approximations

• Bloom Filters provide efficient set lookup

!35 The Main Take-away:

• “Big Data” in genomics makes conventional analysis difficult

• Methods like Rail-RNA and recount2 try to improve efficiency through bulk analysis

• Sketch techniques trade accuracy for speed — and can often improve both

• Minhash and HyperLogLog provide rapid similarity approximations

• Bloom Filters provide efficient set lookup Questions? !35 Class Projects

• Keep working on them!

• Feel free to come see me if you need help or want advice

!36