Smith-Waterman Sequence Alignment for Massively Parallel High-Performance Computing Architectures

SMITH-WATERMAN SEQUENCE ALIGNMENT FOR MASSIVELY PARALLEL HIGH-PERFORMANCE COMPUTING ARCHITECTURES A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shannon Irene Steinfadt May 2010 Dissertation written by Shannon Irene Steinfadt B.A., Hiram College, 2000 M.A., Kent State University, 2003 Ph.D., Kent State University, 2010 Approved by Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee Dr. Kenneth Batcher , Members, Doctoral Dissertation Committee Dr. Paul Farrell Dr. James Blank Accepted by Dr. Robert Walker , Chair, Department of Computer Science Dr. John Stalvey , Dean, College of Arts and Sciences ii TABLE OF CONTENTS LIST OF FIGURES . viii LIST OF TABLES . xii Copyright . xiii Dedication . xiv Acknowledgements . xv 1 Introduction . 1 2 Sequence Alignment . 4 2.1 Background . 4 2.2 Pairwise Sequence Alignment . 5 2.3 Needleman-Wunch . 9 2.4 Smith-Waterman Sequence Alignment . 10 2.5 Scoring . 13 2.6 Opportunities for Parallelization . 16 3 Parallel Computing Models . 19 iii 3.1 Models of Parallel Computation . 19 3.1.1 Multiple Instruction, Multiple Data (MIMD) . 20 3.1.2 Single Instruction, Multiple Data (SIMD) . 22 3.2 Associative Computing Model . 23 3.2.1 Associative Functions . 26 4 Smith-Waterman Using Associative Massive Parallelism (SWAMP) . 29 4.1 Overview . 29 4.2 ASC Emulation . 30 4.2.1 Data Setup . 30 4.2.2 SWAMP Algorithm Outline . 33 4.3 Performance Analysis . 35 4.3.1 Asymptotic Analysis . 35 4.3.2 Performance Monitor Result Analysis . 36 4.3.3 Predicted Performance as S1 and S2 Grow . 38 4.3.4 Additional Avenues of Discovery . 40 4.3.5 Comments on Emulation . 40 4.4 SWAMP with Added Traceback . 41 4.4.1 SWAMP with Traceback Analysis . 44 5 Extended Smith-Waterman Using Associative Massive Parallelism (SWAMP+) 46 5.1 Overview . 46 iv 5.2 Single-to-Multiple SWAMP+ Algorithm . 48 5.2.1 Algorithm . 48 5.3 Multiple-to-Single SWAMP+ Algorithm . 52 5.4 Multiple-to-Multiple SWAMP+ Algorithm . 52 5.4.1 Algorithm . 53 5.4.2 Asymptotic Anaylsis . 55 5.5 Future Directions . 56 5.6 Clearspeed Implementation . 56 6 Feasible Hardware Survey for the Associative SWAMP Implementation . 57 6.1 Overview . 57 6.2 IBM Cell Processor . 58 6.3 Field-Programmable Gate Arrays - FPGAs . 59 6.4 Graphics Processing Units - GPGPUs . 60 6.4.1 Implementing ASC on GPGPUs . 63 6.5 Clearspeed SIMD Architecture . 64 7 SWAMP+ Implementation on ClearSpeed Hardware . 69 7.1 Implementing Associative SWAMP+ on the ClearSpeed CSX . 69 7.2 Clearspeed Running Results . 71 7.2.1 Parallel Matrix Computation . 72 7.2.2 Sequential Traceback . 78 v 7.3 Conclusions . 81 8 Smith-Waterman on a Distributed Memory Cluster System . 82 8.1 Introduction . 82 8.2 JumboMem . 84 8.3 Extreme-Scale Alignments on Clusters . 86 8.3.1 Experiments . 87 8.3.2 Results . 89 8.4 Conclusion . 92 9 Ongoing and Future Work . 94 9.1 Hierarchical Parallelism for Smith-Waterman Incorporating JumboMem 94 9.1.1 Within a Single Core . 95 9.1.2 Across Cores and Nodes . 95 9.2 Continuing SWAMP+ Work . 97 10 Conclusions . 99 BIBLIOGRAPHY . 101 Appendices . 106 A ASC Source Code for SWAMP . 107 A.1 ASC Code for SWAMP . 107 vi B ClearSpeed Code for SWAMP+ . 120 vii LIST OF FIGURES 1 An example of the sequential Smith-Waterman matrix. The depen- dencies of cell (3, 2) are shown with arrows. While the calculated C values for the entire matrix are given, the shaded anti-diagonal (where all i + j values are equal) shows one wavefront or logical parallel step since they can be computed concurrently. Affine gap penalties are used in this example as well as in the parallel code that produces the top alignment and other top scoring alignments. 11 2 Smith-Waterman matrix with traceback and resulting alignment. 13 3 A high-level view of the ASC model of parallel computation. 25 4 Mapping the \shifted" data on to the ASC model. Every S2[$] column stores one full anti-diagonal from the original matrix. Here the number of PEs > m and the unused (idle) PEs are grayed out. When the number of PEs< m, the PEs are virtualized and one PE will process [m=# PEs] worth of work. The PE Interconnection Network is omitted for simplicity. 31 5 Showing (i + j = 4) step-by-step iteration of the m + n loop to shift S2. This loop stores each anti-diagonal in a single variable of the ASC array S2[$] so that it can be processed in parallel. 32 viii 6 Reduction in the number of operations through further parallelization of the SWAMP algorithm. 37 7 Actual and predicted performance measurements using ASCs performance monitor. Predictions were obtained using linear regression and the least squares method and are shown with a dashed line. 39 8 SWAMP+ Variations where k=3 in both a) and b) and k=2 in c). 47 9 A detail of one streaming multiprocessor (SM) is shown here. On CUDA-enabled NVIDIA hardware, a varied number of SMs exist for massively parallel processing. Each SM contains eight streaming processor (SP) cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. One example organization is the NVIDIA Tesla T10 with 30 SMs for a total of 240 SPs. 61 10 The CSX 620 PCI-X Accelerator Board . 65 11 ClearSpeed CSX processor organization. Diagram courtesy of Clear- Speed http:// www.clearspeed.com/products/csx700/. 66 ix 12 The average number of calculation cycles over 30 runs. This graph was broken down into each subalignment. There were eight outliers in over 4500 runs, each were an order of magnitude larger than the cycle counts for the rest of the runs. That is what pulled the calculation cycle count averages up, as seen in the graph. It does show that the number parallel computation steps is roughly the same, regardless of sequence size. Lower is better. 74 13 With the top eight outliers removed, the error bars show the computation cycle counts in the same order of magnitude as the rest of the readings. 75 14 Cell Updates Per Second for Matrix Computation (CUPS) where higher is better. 77 15 The average number of traceback cycles over 30 runs. The longest alignment is the first alignment, as expected. Therefore the first traceback in all runs with 1 to 5 alignments returned has a higher cycle count than any of the subsequent alignments. 79 16 Comparison of Cycle Counts for Computation and Traceback . 80 17 Across multiple node's main memory, JumboMem allows an entire cluster's memory to look like local memory with no additional hardware, no recompilation, and no root account access. 86 x 18 The cell updates per second (CUPS) does experience some performance degradation, but not as much as if it had to page to disk. 89 19 The execution time grows consistently even as JumboMem begins to use other nodes' memory. Note the logarithmic scales, since as input string size doubles, the calculations and memory requirements quadru- ple. .................................... 91 20 A wavefront of wavefronts approach, merging a hierarchy of parallelism, first within a single core, and then across multiple cores. 96 xi LIST OF TABLES 1 PAL Cluster Characteristics . 87 xii Copyright This material is copyright: c 2010 Shannon Irene Steinfadt. xiii This is dedicated to my guys, including Jim, Minky, Ike, Tyke, Spike, Thaddeus, Bandy, BB and the rest of the gang. I include my family who made education and learning a top priority. I also dedicate it to all of my friends and family (by blood and by kindred spirit) who have supported me throughout the years of effort. Shannon Irene Steinfadt March 18, 2010, Kent, Ohio xiv Acknowledgements I acknowledge the help and input from my advisor Dr. Johnnie Baker. In addi- tion, the support from my dissertation committee, the department chair Dr. Robert Walker and the Department of Computer Science at Kent State helped me bring this dissertation to completion. I also acknowledge ClearSpeed for the use of their equipment necessary for my work. And many thanks to the Performance and Architectures Laboratory (PAL) team at Los Alamos National Laboratory, especially Kevin Barker, Darren Kerbyson, and Scott Pakin for their support, advice and insight. The use of the PAL cluster and JumboMem made some of this work possible. My gratitude goes out to the Angel Fire / TAOS team at Los Alamos National Laboratory as well. They supported me during the last few months of intense effort. xv CHAPTER 1 Introduction The increasing growth and complexity of high-performance computing as well as the stellar data growth in the bioinformatics field stand as posts guiding this work. The march is towards increasing processor counts, each processor with an increasing number of compute cores and often associated with accelerator hardware. The bi-annual Top500 listing of the most powerful computers in the world stands as proof of this. With hundreds of thousands of cores, many using accelerators, massive parallelism is a top tier fact in high-performance computing. This research addresses one of the most often used tools in bioinformatics, sequence alignment. While my application focus is sequence alignment, this work is applicable to other problems in other fields. The parallel optimizations and tech- niques presented here for a Smith-Waterman-like sequence alignment can be applied to algorithms that use dynamic programming with a wavefront approach. A pri- mary example is a parallel benchmark called Sweep3D, a neutron transport model.

Load more