Core Processors
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA Los Angeles Parallel Algorithms for Medical Informatics on Data-Parallel Many-Core Processors A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Maryam Moazeni 2013 © Copyright by Maryam Moazeni 2013 ABSTRACT OF THE DISSERTATION Parallel Algorithms for Medical Informatics on Data-Parallel Many-Core Processors by Maryam Moazeni Doctor of Philosophy in Computer Science University of California, Los Angeles, 2013 Professor Majid Sarrafzadeh, Chair The extensive use of medical monitoring devices has resulted in the generation of tremendous amounts of data. Storage, retrieval, and analysis of such data require platforms that can scale with data growth and adapt to the various behavior of the analysis and processing algorithms. In recent years, many-core processors and more specifically many-core Graphical Processing Units (GPUs) have become one of the most promising platforms for high performance processing of data, due to the massive parallel processing power they offer. However, many of the algorithms and data structures used in medical and bioinformatics systems do not follow a data-parallel programming paradigm, and hence cannot fully benefit from the parallel processing power of ii data-parallel many-core architectures. In this dissertation, we present three techniques to adapt several non-data parallel applications in different dwarfs to modern many-core GPUs. First, we present a load balancing technique to maximize parallelism in non-serial polyadic Dynamic Programming (DP), which is a family of dynamic programming algorithms with more non-uniform data access pattern. We show that a bottom-up approach to solving the DP problem exploits more parallelism and therefore yields higher performance. We achieve 228X speedup over an equivalent CPU implementation. Second, we introduce a parallel hash table as a parallel-friendly lock-free dynamic hash table. The parallel hash table structure reduces the contention on the shared objects in lock-free hash table and achieves significant throughput on many-core processor architectures. To reduce the contention, it creates multiple instances of a hash table and uses a table assignment function to distribute hash table operations to different hash table instances and guarantees key uniqueness. We achieved roughly 27X speedup over counter-part multi-thread lock-free hash table on CPU. Third, we present a memory optimization technique for the software-managed scratchpad memory based on G80, GT200, and Fermi architectures to alleviate the constraints of using scratchpad memory. We propose a memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing application performance. Our solution is based on graph coloring. Our evaluations show that using this technique can reduce the execution time of applications on GPUs by up to 22% over the non-optimized GPU implementation. In addition, by leveraging massive parallelism of GPUs, we introduce a novel time-series iii searching technique for multi-dimensional time series. Searching for time series is an intuitive and practical approach to study similarity of patterns, events, and activities in patient histories. However, its computational intensity has traditionally been a constraint in the development of a complex algorithm that can handle patterns in multi-dimensional signals considering noise, scaling, and time correlation between dimensions. Using GPUs, we are able to achieve high speed up in processing signals, while improving the quality of the search algorithm and tackle problems such as noise and scaling. We used data collected from two medical monitoring devices, a Personal Activity Monitor (PAM) and Medical Shoe to evaluate our approach and show that our technique results in up to 25X speed up and up to 15 point improvement in Normalized Discounted Cumulative Gain (NDCG) for such application. iv The dissertation of Maryam Moazeni is approved. Glen Reinman Jens Palsberg Alex Bui Majid Sarrafzadeh, Committee Chair University of California, Los Angeles 2013 v To my parents … vi TABLE OF CONTENTS CHAPTER 1 ---------------------------------------------------------------------------------------------- 1 Introduction ----------------------------------------------------------------------------------------------- 1 1.1 Healthcare Application Demand for High-Performance Computing ................................ 1 1.2 General-Purpose Computing on Graphics Processors ....................................................... 3 1.3 Leveraging GPU for Healthcare ........................................................................................ 4 1.4 Dissertation Contribution ................................................................................................... 5 1.5 Dissertation Organization .................................................................................................. 7 CHAPTER 2 ---------------------------------------------------------------------------------------------- 8 Background ----------------------------------------------------------------------------------------------- 8 2.1 Parallel Computational Models.......................................................................................... 8 2.1.1 PRAM Model ............................................................................................................ 8 2.1.2 BSP Model ................................................................................................................ 9 2.1.3 SIMD vs. MIMD ...................................................................................................... 9 2.2 GPU Architecture Highlights ........................................................................................... 10 2.2.1 Architectural Features ............................................................................................. 11 2.2.2 Threading Model ..................................................................................................... 12 2.3 Programming Dwarfs ....................................................................................................... 13 2.4 Medical Applications and Platforms ................................................................................ 15 2.4.1 Diffusion Tensor Imaging Denoising ..................................................................... 15 2.4.2 Medical Shoe .......................................................................................................... 15 2.4.3 Personal Activity Monitor....................................................................................... 16 CHAPTER 3 --------------------------------------------------------------------------------------------- 18 Dynamic Programming on Data-Parallel Many-core Architectures ----------------------------- 18 3.1 Overview .......................................................................................................................... 18 3.2 Related Work ................................................................................................................... 20 3.3 Problem Formulation ....................................................................................................... 21 3.3.1 Parallelism............................................................................................................... 22 3.4 Non-Serial Polyadic DP on GPU ..................................................................................... 24 3.5 Decomposition Algorithm ............................................................................................... 26 vii 3.6 Performance Analysis ...................................................................................................... 30 3.6.1 Decomposition ........................................................................................................ 31 3.6.2 Scaling..................................................................................................................... 35 3.7 Conclusion ....................................................................................................................... 36 CHAPTER 4 --------------------------------------------------------------------------------------------- 37 Lock-Free Parallel-Friendly Hash Table on a Data-Parallel Many-Core Processor ----------- 37 4.1 Overview .......................................................................................................................... 37 4.2 Related Work ................................................................................................................... 41 4.3 CAS-based Lock-free Algorithm ..................................................................................... 42 4.4 Implementation on Data-Parallel Many-Core Architectures ........................................... 45 4.4.1 Limitations on Data-Parallel Many-Core Architectures ......................................... 47 4.4.2 Basic GPU Implementation .................................................................................... 48 4.5 Parallel-Friendly Lock-Free Hash Table ......................................................................... 49 4.5.1 Introducing Parallel Hash Table ............................................................................. 50 4.5.2 Proof of Correctness ............................................................................................... 54 4.5.3 Table