Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons
Total Page:16
File Type:pdf, Size:1020Kb
Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons Maciej Besta1y, Raghavendra Kanakagiri5y, Harun Mustafa1;3;4, Mikhail Karasikov1;3;4, Gunnar Ratsch¨ 1;3;4, Torsten Hoefler1, Edgar Solomonik2 1Department of Computer Science, ETH Zurich 2Department of Computer Science, University of Illinois at Urbana-Champaign 3University Hospital Zurich, Biomedical Informatics Research 4SIB Swiss Institute of Bioinformatics 5Department of Computer Science, Indian Institute of Technology Tirupati yBoth authors contributed equally to this work. Abstract—The Jaccard similarity index is an important mea- sequencing data sets), machine learning (clustering, object sure of the overlap of two sets, widely used in machine learning, recognition) [8], information retrieval (plagiarism detection), computational genomics, information retrieval, and many other and many others [82], [33], [65], [68], [70], [92], [80]. areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the We focus on the application of the Jaccard similarity index Jaccard similarity among pairs of large datasets. Our algo- and distance to genetic distances between high-throughput rithm provides an efficient encoding of this problem into a DNA sequencing samples, a problem of high importance multiplication of sparse matrices. Both the encoding and sparse for different areas of computational biology [71], [92], [29]. matrix product are performed in a way that minimizes data Genetic distances are frequently used to approximate the movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of evolutionary distances between the species or populations a set of large samples of genomes. This task is a key part of represented in sequence sets in a computationally intractable modern metagenomics analysis and an evergrowing need due to manner [71], [63], [92], [29]. This enables or facilitates analyz- the increasing availability of high-throughput DNA sequencing ing the evolutionary history of populations and species [63], data. The resulting scheme is the first to enable accurate the investigation of the biodiversity of a sample, as well as Jaccard distance derivations for massive datasets, using large- scale distributed-memory systems. We package our routines in other applications [92], [64]. However, the enormous sizes a tool, called GenomeAtScale, that combines the proposed algo- of today’s high-throughput sequencing datasets often make rithm with tools for processing input sequences. Our evaluation it infeasible to compute the exact values of J or dJ [79]. on real data illustrates that one can use GenomeAtScale to Recent works (such as Mash [63]) propose approximations, effectively employ tens of thousands of processors to reach new for example using the MinHash representation of J(A; B), frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more which is the primary locality-sensitive hashing (LSH) scheme general underlying SimilarityAtScale algorithm may be used for used for genetic comparisons [57]. Yet, these approximations high-performance distributed similarity computations in other often lead to inaccurate approximations of dJ for highly data analytics application domains. similar pairs of sequence sets, and tend to be ineffective Index Terms—Distributed Jaccard Distance, Distributed Jac- for computation of a distance between highly dissimilar sets card similarity, Genome Sequence Distance, Metagenome Se- quence Distance, High-Performance Genome Processing, k-Mers, unless very large sketch sizes are used [63], as noted by Matrix-Matrix Multiplication, Cyclops Tensor Framework the Mash authors [63]. Thus, developing a scalable, high- Code and data: https://github.com/cyclops-community/jaccard-ctf performance, and accurate scheme for computing the Jaccard arXiv:1911.04200v3 [cs.CE] 11 Nov 2020 similarity index is an open problem of considerable relevance I. INTRODUCTION for genomics computations and numerous other applications The concept of similarity has gained much attention in in general data analytics. different areas of data analytics [81]. A popular method of Yet, there is little research on high-performance distributed assessing a similarity of two entities is based on computing derivations of either J or dJ . Existing works target very small their Jaccard similarity index J(A; B) [49], [59]. J(A; B) is datasets (e.g., 16MB of raw input text [31]), only provide a statistic that assesses the overlap of two sets A and B. simulated evaluation for experimental hardware [56], focus It is defined as the ratio between the size of the intersec- on inefficient MapReduce solutions [26], [6], [86] that need tion of A and B and the size of the union of A and B: asymptotically more communication due to using the allreduce J(A; B) = jA \ Bj=jA [ Bj. A closely related notion, the collective communication pattern over reducers [47], are lim- Jaccard distance dJ = 1 − J, measures the dissimilarity of ited to a single server [36], [66], [69], use an approach based sets. The general nature and simplicity of the Jaccard similarity on deriving the Cartesian product, which may require quadratic and distance has allowed for their wide use in numerous do- space (infeasible for the large input datasets considered in mains, for example computational genomics (comparing DNA this work) [48], or do not target parallelism or distribution at all [7]. Most works target novel use cases of J and dJ [20], are publicly available in order to enable interpretability and [22], [83], [53], or mathematical foundations of these mea- reproducibility, and to facilitate its integration into current and sures [54], [62]. We attribute this to two factors. First, many future genomics analysis pipelines. domains (for example genomics) only recently discovered the II. JACCARD MEASURES:DEFINITIONS AND IMPORTANCE usefulness of Jaccard measures [92]. Second, the rapid growth of the availability of high-throughput sequencing data and We start with defining the Jaccard measures and with its increasing relevance to genetic analysis have meant that discussing their applications in different domains. The most previous complex genetic analysis methods are falling out of important symbols used in this work are listed in Table I. favor due to their poor scaling properties [92]. To the best of While we focus on high-performance computations of dis- our knowledge, no work provides a scalable high-performance tances between genome sequences, our design and implemen- solution for computing J or dJ . tation are generic and applicable to any other use case. In this work, we design and implement SimilarityAtScale: the first algorithm for distributed, fast, and scalable derivation General Jaccard measures: of the Jaccard Similarity index and Jaccard distance. We fol- J(X; Y ) The Jaccard similarity index of sets X and Y . dJ (X; Y ) The Jaccard distance between sets X and Y ; dJ = 1 − J. low [44], devising an algorithm based on an algebraic formula- Algebraic Jaccard measures (SimilarityAtScale), details in III-A: tion of Jaccard similarity matrix computation as sparse matrix The number of possible data values (attributes) and data samples, m; n multiplication. Our main contribution is a communication- one sample contains zero, one, or more values (attributes). avoiding algorithm for both preprocessing the sparse matrices ⊗; Matrix-vector (MV) and vector dot products. The indicator matrix (it determines the presence of data values as well as computing Jaccard similarity in parallel via sparse A m×n in data samples), A 2 B ; B = f0; 1g. matrix multiplication. We use our algorithm as a core element The similarity and distance matrices with the values of Jaccard S; D n×n of GenomeAtScale, a tool that we develop to facilitate high- measures between all pairs of data samples; S; D 2 R . B; C Intermediate matrices used to compute S, B; C 2 n×n. performance genetic distance computation. GenomeAtScale N Genomics and metagenomics computations (GenomeAtScale): enables the first massively-parallel and provably accurate cal- k The number of single nucleotides in a genome subsequence. culations of Jaccard distances between genomes. By maintain- m; n The number of analyzed k-mers and genome data samples. ing compatibility with standard bioinformatics data formats, TABLE I: The most important symbols used in the paper. we allow for GenomeAtScale to be seamlessly integrated into existing analysis pipelines. Our performance analysis illus- A. Fundamental Definitions trates the scalability of our solutions. In addition, we bench- The Jaccard similarity index [59] is a statistic used to assess mark GenomeAtScale on both large (2,580 human RNASeq the similarity of sets. It is defined as the ratio of the set experiments) and massive (almost all public bacterial and intersection cardinality and the set union cardinality, viral whole-genome sequencing experiments) scales, and we make this data publicly available to foster high-performance jX \ Y j jX \ Y j distributed genomics research. Despite our focus on genomics J(X; Y ) = = ; jX [ Y j jXj + jY j − jX \ Y j data, the algebraic formulation and the implementation of our routines for deriving Jaccard measures are generic and can be where X and Y are sample sets. If both X and Y are directly used in other settings. empty, the index is defined as J(X; Y ) = 1. One can then Specifically, our work makes the following contributions. use J(X; Y ) to define the Jaccard distance dJ (X; Y ) = 1−J(X; Y ), which assesses the dissimilarity of sets and which • We design SimilarityAtScale, the first communication- is a proper metric (on the collection of all finite sets). efficient distributed algorithm to compute the Jaccard sim- We are interested in the computation of similarity between ilarity index and distance. all pairs of a collection of sets (in the context of genomics, this • We use our algorithm as a backend to GenomeAtScale, the collection contains samples from one or multiple genomes). first tool that enables fast, scalable, accurate, and large-scale Let X = fX1;:::;Xng be the set of data samples.