Exploring Performance Portability for Accelerators Via High-Level Parallel Patterns

Exploring Performance Portability for Accelerators via High-level Parallel Patterns Kaixi Hou Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Application Wu-chun Feng, Chair Calvin J. Ribbens Hao Wang Yong Cao Gagan Agrawal May 2, 2018 Blacksburg, Virginia Keywords: Parallel Patterns, Vector Processing, SIMD, Parallel Accelerators, Multi-core, Many-core, DSL, Algorithmic Skeleton Copyright 2018, Kaixi Hou Exploring Performance Portability for Accelerators via High-level Parallel Patterns Kaixi Hou (ABSTRACT) Parallel computing dates back to the 1950s-1960s. It is only in recent years, as various parallel accelerators have become prominent and ubiquitous, that the pace of parallel computing has sped up and has had a strong impact on the design of software/hardware. The essence of the trend can be attributed to the physical limits of further increasing operating frequency of processors and the shifted focus on integrating more computing units on one chip. Driven by the trend, many commercial parallel accelerators are available and become commonplace in computing systems, including multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. Compared to a traditional single-core CPU, the performance gains of parallel accelerators can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. Nevertheless, it presents two main difficulties for domain scientists and even expert developers: (1) To fully utilize the underlying computing resources, users have to redesign their applications by using low-level languages (e.g., CUDA, OpenCL) or idiosyncratic intrinsics (e.g., AVX vector instructions). Only after a careful redesign is carried out can efficient kernels be written. (2) These kernels are oftentimes specialized and in turn imply that as far as the performance is concerned, the codes might be not portable to a new given architecture, leading to more tedious and error-prone redesign. Parallel patterns, therefore, can be a promising solution in lifting the burdens from pro- grammers. The idea of parallel patterns attempts to build a bridge between target algorithms and parallel architectures. There has already been some research on using high-level languages as parallel patterns for programming on modern accelerators, such as ISPC, Lift, etc. However, the existing approaches fall short from efficiently taking advantage of both parallelism of algorithms and diversified features of accelerators. For example, by using such high-level languages, it is still challenging to describe the patterns of the convoluted data-reordering operations in sorting net- works, which are oftentimes treated as building blocks of parallel sort kernels. On the other hand, compilers or frameworks lack the capability and flexibility to quickly and efficiently re-target the parallel patterns to another different architectures (e.g., what types of vector instructions to use, how to efficiently organize the instructions, etc.). To overcome the aforementioned limitations, we propose a general approach that can create an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation changes to match different architectures, etc. In this dissertation, we present our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across parallel platforms (CPUs, GPUs, Intel Xeon Phi). Through these studies, we show: (1) Our automation frameworks use DSL or algorithmic skeletons to express parallelism. (2) Generated parallel programs take into consideration of both parallelism of the target problems and characteristics of underlying hardware. (3) Evalu- ations are discussed on the portable performance and speedups over the state-of-the-art optimized codes. Besides, for some problems, we also propose novel adaptive algorithms to accommodate the parallel patterns to varying input scenarios. Exploring Performance Portability for Accelerators via High-level Parallel Patterns Kaixi Hou (GENERAL AUDIENCE ABSTRACT) Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues. To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi). ACKNOWLEDGEMENTS First of all, I would like to thank my advisor Prof. Wu-chun Feng, who provides me invaluable knowledge and insights on research, communication skills, and introduces me into the world of academia. I have learned so much from Dr. Feng: how to read/write research papers, how to express my idea briefly and deliver a good presentation, how to establish productive collaboration with others, etc. Without Wu’s guidance and support, I would have never reach where I am now. Next, I would like to thank my other committee members: Prof. Calvin J. Ribbens, Dr. Hao Wang, Dr. Yong Cao, and Prof. Gagan Agrawal. I appreciate all their insightful comments and feedback on my dissertation, as well as their precious time they put into serving at various stages of my Ph.D. journey. Special thanks go to Dr. Hao Wang for working closely with me on many research projects and providing me with great suggestion and helpful advise on strengthening my research capability. I have been fortunate to work with many researchers from other research institutes. I am grateful to Shuai Che (AMD research) for his guidance on the GPU project to accelerate graph algorithms. My thanks also go to Jonathan Gallmeier (AMD research) for the technical discus- sions on how to efficiently utilize different GPU platforms. I am indebted to Seyong Lee, Jeffrey Vetter, and Mehmet Belviranli (Oak Ridge National Lab), for the extended brainstorming discus- sions on GPU-based wavefront problems. I would also like to thank Weifeng Liu (University of Copenhagen) for sharing his knowledge and experience on GPU computing. I also want to share my gratitude to the wonderful people around me throughout these years. I am so fortunate to have Jing Zhang, Xiaodong Yu, Xuewen Cui, Da Zhang as my lab colleagues during my Ph.D. I cannot forget the many “pizza” nights that we spent on the submission deadlines. Special thanks go to my roommates, Fang Liu, Yao Zhang, and Jinxing Wang/Libei Huang couple, at different stages of my Ph.D. I would like to thank the following people who make my life colorful and entertaining in this “isolated” Blacksburg: Bo Li/Yao Wang couple, Hao Zhang, Yue Cheng, Sichao Wu, Liangzhe Chen, Qianzhou Du, Xiaokui Shu, Ji Wang, Run Yu, Mengsu Chen. My best wishes to all of you for the future endeavors. Finally, I am immensely grateful to my family back in China: my parents, Jianjun Hou and v Lu Zhao. Without their support, understanding, and unconditional love, I cannot go through my ups and downs and finish my Ph.D. A last big thank you to my wife, Lingjie Zhang, for her love, support, for bringing enthusiasm in my life, and for being patient and understanding. Funding Acknowledgment My research was supported by National Science Foundation (NSF- BIGDATA IIS-1247693, NSF-XPS CCF1337131), Air Force Office of Scientific Research (AFOSR) Basic Research Initiative (Grant No. FA9550-12-1-0442), the European Unions Horizon 2020 research and innovation programme under the

Exploring Performance Portability for Accelerators Via High-Level Parallel Patterns

Fast Parallel GPU-Sorting Using a Hybrid Algorithm

GPU Sample Sort We Should Mention That Hybrid Sort [15] and Bbsort [4] Involve a Distribution Phase Assuming That the Keys Are Uniformly Distributed

A Comparative Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis

Parallelization of Tim Sort Algorithm Using Mpi and Cuda

An English-Persian Dictionary of Algorithms and Data Structures

Register Level Sort Algorithm on Multi-Core SIMD Processors

Hybrid Quicksort: an Empirical Study

Fast Parallel Gpu-Sorting Using a Hybrid Algorithm

Sorting with Gpus: a Survey Dmitri I

Comparison Based Sorting for Systems with Multiple Gpus

1 Slide 3: Introduction Multi-Way Sorting Algorithms

UC Davis IDAV Publications