Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Xin Huo, B.S., M.S. Graduate Program in Department of Computer Science and Engineering The Ohio State University 2014 Dissertation Committee: Dr. Gagan Agrawal, Advisor Dr. P. Sadayappan Dr. Feng Qin c Copyright by Xin Huo 2014 Abstract Parallel computing architectures have become ubiquitous nowadays. Particularly, Graph- ics Processing Unit (GPU), Intel Xeon Phi co-processor (many integrated core architecture), and heterogeneous coupled and decoupled CPU-GPU architectures have emerged as a very significant player in high performance computing due to their performance, cost, and energy efficiency. Based on this trend, an important research issue is how to efficiently utilize these varieties of architectures to accelerate different kinds of applications. Our overall goal is to provide parallel strategies and runtime support for modern GPUs, Intel Xeon Phi, and heterogeneous architectures to accelerate the applications with different communication patterns. The challenges rise from two aspects, application and architecture respectively. In the application aspect, not all kinds of applications can be easily ported to GPUs or Intel Xeon Phi with optimal high performance. From the architecture aspect, the SIMT (Single Instruction Multiple Thread) and SIMD (Single Instruction Mul- tiple Data) execution models employed by GPUs and Intel Xeon Phi respectively do not favor applications with data and control dependencies. Moreover, heterogeneous CPU- GPU architectures, i.e., the integrated CPU-GPU, bring new models and challenges for data sharing. Our efforts focus on four kinds of application patterns: generalized reduction, irregular reduction, stencil computation, and recursive applications, and explore the mechanism of parallelizing them on various parallel architectures, including GPU, Intel Xeon Phi architecture, and heterogeneous CPU-GPU architectures. ii We first study the parallel strategies for generalized reductions on modern GPUs. The traditional mechanism to parallelize generalized reductions on GPUs is referred to as full replication method, which assigns each thread an independent data copy to avoid data rac- ing and maximize parallelism. However, it introduces significant memory and result com- bination overhead, and is inapplicable for a large number of threads. In order to reduce memory overhead, we propose a locking method, which allows all threads per block to share the same data copy, and supports fine-grained and coarse-grained locking access. The locking method succeeds in reducing memory overhead, but introduces thread competition. To achieve a tradeoff between memory overhead and thread competition, we also present a hybrid scheme. Next, we investigate the strategies for irregular or unstructured applications. One of the key challenges is how to utilize the limited-sized shared memory on GPUs. We present a partitioning-based locking scheme, by choosing the appropriate partitioning space which can guarantee all the data can be put into the shared memory. Moreover, we propose an efficient partitioning method and data reordering mechanism. Further, to better utilize the computing capability on both host and accelerator, we extend our irregular parallel scheme to the heterogeneous CPU-GPU environment by intro- ducing a multi-level partitioning scheme and a dynamic task scheduling framework, which supports pipelining between partitioning and computation, and work stealing between CPU and GPU. Then, motivated by the emerging of the integrated CPU-GPU architecture and its new shared physical memory, we develop the thread block level scheduling framework for three communication patterns, generalized reduction, irregular reduction, and stencil computation. This novel scheduling framework introduces locking free access between CPU iii and GPU, reduces command launching and synchronization overhead by removing extra synchronizations, and improves load balance. Although the support of unstructured control follow has been included in GPUs, the performance of recursive applications on modern GPUs is limited due to the current thread reconvergence method in the SIMT execution model. To efficiently port recursive applications to GPUs, we present our novel dynamic thread reconvergence method, which allows threads to reconverge either before or after the static reconvergence point, and improves instructions per cycle (IPC) significantly. Intel Xeon Phi architecture is an emerging x86 based many-core coprocessor architecture with wide SIMD vectors. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. We consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reorganization and our methods to effectively manage control flow. In addition, all proposed methods have been extensively evaluated with multiple applications and different data inputs. iv This is dedicated to the ones I love: my parents and my fiancee. v Acknowledgments During the last five years of my Ph.D. pursuing, there are so may people to whom I would like to give my sincerest thanks! Without their help and support, it would not have been possible to write this doctoral dissertation. To only some of them, it is possible to give particular mention here. First and foremost, I would like to thank my advisor Dr. Gagan Agrawal for his non- compromising dedication to the development and education of his students. This dissertation would be not have been possible without his guidance, support and patience. During the last five years, he not only guided me on the research of computer system and software, but also help me developing a broad and strong foundation upon which a life pursuing computer science and technology will be built. His commitment not only taught me all the required qualities as a researcher but also inspired me as a person. The joy, passion, and enthusiasm has has for his research were contagious and motivational for me in my future career. I would also like to thank the members of my thesis committee, Drs. P. Sadayappan, Feng Qin, and Radu Teodorescu. It is through their active interest an incisive and critical analysis that I have been able to complete my research. I would like to acknowledge the financial, academic, technical and all other kinds of support from National Science Foundation, The Ohio State University, and the Computer Science and Engineering Department, especially the head of our department, Dr. Xiaodong Zhang, and all administrative staffs our department, especially Kathryn M. Reeves and vi Lynn Lyons. Also, I would like to extend my sincerest thanks to my internship mentor, Dr. Sriram Krishnamoorthy from Pacific Northwest National Laboratory. I am truly grateful to the past and present members of my lab, Data-Intensive and High Performance Computing Research Group. They gave me invaluable help and support on research, career, and life, and they are: Wenjing Ma, David Chiu, Qian Zhu, Fan Wang, Vignesh Ravi, Wei Jiang, Tantan Liu, Tekin Bicer, Jiedan Zhu, Bin Ren, Yu Su, Linchuan Chen, Yi Wang, Mehmet Can Kurt, Mucahid Kutlu, Jiaqi Liu, and Sameh Shohdy. Special thanks are given to my fiancee, Jingyan Wu, who has always been by my side. I am also thankful to my parents and all the rest of my family in China and as well as all of the friends that I have encounter throughout my life. It is with their love, support, and encouragement, that this dissertation is in existence. vii Vita October 13th, 1984 . Born - Tianjin, China 2007 . B.S. Software Engineering, Beijing Institute of Technology, Beijing, China 2009 . M.S. Software Engineering, Beijing Institute of Technology, Beijing, China 2013 . M.S. Computer Science and Engineer- ing, The Ohio State University, Colum- bus, USA Publications Research Publications Xin Huo, Bin Ren, and Gagan Agrawal. “A Programming System for Xeon Phis with Runtime SIMD Parallelization”. In Proceedings of The 28th ACM International Confer- ence on SuperComputing (ICS), June 2014. Linchuan Chen, Xin Huo, and Gagan Agrawal “Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores”. In Proceedings of The 23th International Heterogeneity in Computing Workshop (HCW), May, 2014. Xin Huo, Sriram Krishnamoorthy, and Gagan Agrawal. “Efficient Scheduling of Recur- sive Control Flow on GPUs”. In Proceedings of The 27th ACM International Conference on SuperComputing (ICS), June 2013. Linchuan Chen, Xin Huo, and Gagan Agrawal. “Accelerating MapReduce on a Coupled CPU-GPU Architecture”. In Proceedings of The 25th IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC),

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support