Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale
Total Page:16
File Type:pdf, Size:1020Kb
SCALABLE PARALLEL METHODS FOR ANALYZING METAGENOMICS DATA AT EXTREME SCALE By JEFFREY ALAN DAILY A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY WASHINGTON STATE UNIVERSITY School of Electrical Engineering and Computer Science MAY 2015 c Copyright by JEFFREY ALAN DAILY, 2015 All rights reserved c Copyright by JEFFREY ALAN DAILY, 2015 All rights reserved To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of JEFFREY ALAN DAILY find it satisfactory and recommend that it be accepted. Ananth Kalyanaraman, Ph.D., Chair John Miller, Ph.D. Carl Hauser, Ph.D. Sriram Krishnamoorthy, Ph.D. ii ACKNOWLEDGEMENT First, I would like to thank Dr. Ananth Kalyanaraman for his supervision and support throughout the work. I would like to thank Dr. Abhinav Vishnu who encouraged my pursuit of my Ph.D. before anyone else, suggested Dr. Kalyanaraman as my advisor, and has been a valuable mentor and friend. I would like to thank Dr. Sriram Krishnamoorthy for his role on my graduate committee, for his research support at our employer, Pacific Northwest Northwest Laboratory (PNNL), and for his mentoring role while performing my research. I would like to thank Dr. John Miller and Dr. Carl Hauser for being on my graduate committee. Last, but not least, I would like to thank my employer, PNNL, for the tuition reimbursement benefit that assisted my pursuit of this degree. iii SCALABLE PARALLEL METHODS FOR ANALYZING METAGENOMICS DATA AT EXTREME SCALE Abstract by Jeffrey Alan Daily, Ph.D. Washington State University May 2015 Chair: Ananth Kalyanaraman, Ph.D. The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moores law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment iv algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implemen- tation for conducting optimal homology detection on distributed memory supercomput- ers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ∼75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results re- veal that current sequence alignment algorithms are unable to fully utilize widening vector registers. v TABLE OF CONTENTS Page ACKNOWLEDGEMENTS . iii ABSTRACT . v LIST OF TABLES . ix LIST OF FIGURES . x CHAPTER 1. INTRODUCTION . 1 1.1 Quantifying Scaling Requirements . 4 1.2 Contributions . 5 2. BACKGROUND AND RELATED WORK . 7 2.1 Algorithms and Data Structures for Homology Detection . 7 2.1.1 Notation . 7 2.1.2 Sequence Alignment . 8 2.1.3 Sequence Homology . 10 2.1.4 String Indices for Exact Matching . 11 2.2 Parallelization of Homology Detection . 14 2.2.1 Vectorization Opportunities in Sequence Alignment . 15 2.2.2 Distributed Alignments . 18 3. SCALABLE PARALLEL METHODS . 20 3.1 Filtering Sequence Pairs . 20 vi 3.1.1 Suffix Tree Filter . 20 3.1.2 Length Based Cutoff Filter . 23 3.1.3 Storing Large Numbers of Tasks . 24 3.1.4 Storing Large Sequence Datasets . 24 3.2 Load Imbalance . 25 3.2.1 Load Imbalance Caused by Filters . 25 3.2.2 Load Imbalance in Homology Detection . 26 3.2.3 Solutions to Load Imbalance . 27 3.3 Implementation . 29 4. RESULTS AND DISCUSSION . 31 4.1 Compute Resources . 31 4.2 Datasets . 31 4.3 Length-Based Filter . 32 4.4 Suffix Tree Filter . 33 4.4.1 Suffix Tree Heuristic Parameters . 34 4.4.2 Distributed Datasets . 36 4.4.3 Strong Scaling . 36 5. EXTENSIONS . 42 5.1 Vectorized Sequence Alignments using Prefix Scan . 42 5.1.1 Algorithmic Comparison of Striped and Scan . 44 5.1.2 Workload Characterization . 47 5.1.3 Empirical Characterization . 48 5.1.3.1 Cache Analysis . 51 5.1.3.2 Instruction Mix Analysis . 52 5.1.3.3 Query Length versus Performance . 54 vii 5.1.3.4 Query Length versus Number of Striped Corrections . 56 5.1.3.5 Scoring Criteria Analysis . 57 5.1.3.6 Prescriptive Solutions on Choice of Algorithm . 58 5.2 Communication-Avoiding Filtering using Tiling . 60 6. CONCLUSIONS AND FUTURE WORK . 64 viii LIST OF TABLES Page 2.1 Enhanced Suffix Array for Input String s = mississippi$. 14 3.1 Notation used in the dissertation. 21 5.1 Relative performance of each vector implementation approach. For this study, we used a small but representative protein sequence dataset and aligned every protein to each other protein. The vector implementations used the SSE4.1 ISA, splitting the 128-bit vector register into 8 16-bit in- tegers. The codes were run using a single thread of execution on a Haswell CPU running at 2.3 Ghz. 44 5.2 Cache analysis of all-to-all sequence alignment for the Bacteria 2K dataset on Haswell. Scan and Striped are using 4, 8, and 16 Lanes. For both scan and striped, instruction and data miss rates for all cache levels were less than 1 percent and are therefore not shown. The primary difference between approaches is indicated by the instruction and data references. 51 5.3 Cache analysis of all-to-all sequence alignment for the Bacteria 2K dataset on Xeon Phi. Scan and Striped are using only 16 lanes because the KNC ISA only supports 32-bit integers which restricts this analysis to 16 lanes. 52 5.4 Decision table showing which algorithm should be used given a particular class of sequence alignment and query length. 58 ix LIST OF FIGURES Page 2.1 Input string s = mississippi$ and corresponding suffix tree Γ. 13 2.2 Known ways to vectorize Smith-Waterman alignments using vectors with four elements. The tables shown here represent a query sequence of length 18 against a database sequence of length 6. Alignment tables are shown with colored elements, indicating the most recently computed cells. In or- der of most recently computed to least recently, the order is green, yellow, orange, and red. Dark gray cells were computed more than four vector epochs ago. Light gray cells indicate padded cells, which are required to properly align the computation(s) but are otherwise ignored or discarded. The blue lines indicate the relevant portion of the tables with the table ori- gin in the upper left corner. (Blocked) Vectors run parallel to the the query sequence. Each vector may need to recompute until values converge. First described by Rognes and Seeberg (Rognes and Seeberg, 2000). (Diagonal) Vectors run parallel to the anti-diagonal. Fist described by Wozniak (Woz- niak, 1997). (Striped) Vectors run parallel to the query using a striped data layout. A column may need to be recomputed at most P − 1 times until the values converge. First described by Farrar (Farrar, 2007). (Scan) This is the approach taken in this dissertation. It is similar to (Striped) but requires exactly two iterations over a column. 16 3.1 Characterization of time spent in alignment operations for an all-against-all alignment of a 15K sequence dataset from the CAMERA database. 27 x 3.2 Schematic illustration of the execution on a compute node. One thread is reserved to facilitate the transfer of tasks. Tasks are stolen only from the shared portion of the deque and delivered only to the private portion. The set of input sequences typically fits entirely within a compute node and is shared by all worker threads. The task pool starts having only subtree processing (pair generation) tasks but as subtree tasks are processed they add pair alignment tasks to the pool, as well. The dynamic creation and stealing of tasks causes the tasks to become unordered. 30 4.1 Execution times for the 80K dataset using the dynamic load balancing strategies of work stealing (‘Brute’) and distributed task counters (‘Counter’). Additionally, a length-based filter is applied to each strategy. Work steal- ing iterators are not considered as they performed similarly to the original work stealing approach. 33 4.2 Execution times for suffix tree filter and best non-tree filter strategy for the 80K sequence dataset. 35 4.3 Execution times for suffix tree filter for the 80K sequence dataset when it is replicated on each node or distributed across nodes.