Tousimojarad, Ashkan (2016) GPRM: a High Performance Programming Framework for Manycore Processors. Phd Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Tousimojarad, Ashkan (2016) GPRM: a high performance programming framework for manycore processors. PhD thesis. http://theses.gla.ac.uk/7312/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] GPRM: A HIGH PERFORMANCE PROGRAMMING FRAMEWORK FOR MANYCORE PROCESSORS ASHKAN TOUSIMOJARAD SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy SCHOOL OF COMPUTING SCIENCE COLLEGE OF SCIENCE AND ENGINEERING UNIVERSITY OF GLASGOW NOVEMBER 2015 c ASHKAN TOUSIMOJARAD Abstract Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards in- creased parallelism. However, increased parallelism does not necessarily lead to better per- formance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general- purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing use- ful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU fac- torisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factori- sation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this over- head can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demon- strate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity. Acknowledgements First and foremost, I would like to thank my PhD supervisor, Dr. Wim Vanderbauwhede for his support, encouragement and guidance over the past four years. I would also like to extend my gratitude to my second supervisor, Dr. William Paul Cockshott for his assistance and advice throughout my PhD study. It has been greatly appreciated and I cannot thank you both enough. In addition, I would like to acknowledge the Scottish Informatics and Computer Science Alliance (SICSA) for funding this research. I would like to express my heartfelt thanks to my great family, my mother, Firoozeh, my father, Khosro, and my sister, Shirin for all their love and encouragement. I would also like to take this opportunity to thank Dr. Jeremy Singer and Professor Sven- Bodo Scholz, my viva examiners, for their helpful comments and suggestions. Special thanks to all the staff at the school of computing science for their help and support during these years. Last but not least, I would like to thank Lily for her kindness, support, and encouragement. To my family. Table of Contents 1 Introduction 1 1.1 The Shift Towards Chip Multiprocessors (CMPs) . .2 1.2 Parallel Computing . .3 1.2.1 From Threads to Tasks . .4 1.3 Thesis Statement . .5 1.4 Thesis Organisation . .5 2 Background 9 2.1 Design of Parallel Programs . .9 2.1.1 Partitioning . .9 2.1.2 Communication . 10 2.1.3 Agglomeration . 11 2.1.4 Mapping . 12 2.2 Parallel Programming Paradigms . 12 2.2.1 Interaction . 13 2.2.2 Decomposition . 14 2.3 Parallel Programming Models and Standards . 14 2.3.1 OpenMP . 15 2.3.2 Cilk Plus . 16 2.3.3 Intel Threading Building Blocks (TBB) . 16 2.3.4 Task Parallel Library (TPL) . 16 2.3.5 Message Passing Interface (MPI) . 17 2.3.6 Unified Parallel C (UPC) . 17 2.3.7 Cascade High Productivity Language (Chapel) . 18 2.3.8 StarSs, SMPSs, and OmpSs . 18 2.3.9 OpenStream . 19 2.3.10 Swan . 20 2.3.11 Glasgow Parallel Haskell . 20 2.3.12 Clojure . 20 2.3.13 Single Assignment C (SAC) . 21 2.3.14 OpenCL . 21 2.3.15 MapReduce . 22 2.4 Task Scheduling and Load Balancing . 22 2.4.1 Work Stealing . 23 2.4.2 Scheduling on Modern Processors . 23 2.5 Related Work . 24 2.5.1 Architectural Aspects . 24 2.5.2 Programming and Performance Aspects . 26 2.6 Summary . 29 3 Hardware Platforms 30 3.1 Flynn’s Taxonomy . 30 3.2 Memory Organisation . 32 3.2.1 Distributed Memory . 32 3.2.2 Shared Memory . 33 3.2.3 Memory Access Time Reduction . 33 3.3 Design Variants of Multicore Machines . 34 3.4 Tilera TILEPro64 . 35 3.4.1 TILEPro64 Architecture . 36 3.4.2 TILEPro64 Performance Considerations . 37 3.5 Intel Xeon Phi . 38 3.5.1 Xeon Phi Architecture . 39 3.5.2 Xeon Phi Performance Considerations . 39 3.6 Summary . 40 4 Task-based Parallel Models for Shared Memory Programming 41 4.1 Three Popular Task-based Parallel Models . 41 4.1.1 OpenMP . 42 4.1.2 Cilk Plus . 42 4.1.3 Threading Building Blocks (TBB) . 43 4.2 Comparison on the Xeon Phi: Uniprogramming Workloads . 44 4.2.1 Experimental Setup . 44 4.2.2 Parallel Fibonacci Benchmark: Fibonacci . 45 4.2.3 Parallel Merge Sort Benchmark: MergeSort . 47 4.2.4 Parallel Matrix Multiplication Benchmark: MatMul . 47 4.2.5 Overhead of the Runtime Libraries . 51 4.3 Comparison on the Xeon Phi: Multiprogramming Workloads . 51 4.4 Multiprogramming Benchmark . 52 4.5 Information Sharing and Multiprogramming . 54 4.5.1 Thread Mapping Strategies . 54 4.5.2 Selected OpenMP Benchmarks for the TILEPro64 . 57 4.5.3 Results of Multiprogramming using Information Sharing . 59 4.6 Summary . 61 5 GPRM: The Glasgow Parallel Reduction Machine 62 5.1 GPRM Architecture . 64 5.2 GPC: The GPRM Front-End Language . 65 5.2.1 GPC Language Features . 65 5.2.2 GPC Compiler . 66 5.3 GPRM Runtime System . 66 5.3.1 Implementation of the Runtime System . 67 5.4 GPRM Model of Parallel Execution . 68 5.4.1 Communication Messages . 72 5.5 GPRM Task Stealing . 73 5.5.1 Comparison of the Stealing Strategies . 73 5.5.2 Implementation of Task Stealing in GPRM . 77 5.6 GPRM Global Sharing . 79 5.7 GPRM Productivity . 80 5.8 Summary . 81 6 Comparison of GPRM with Popular Parallel Programming Models 82 6.1 Uniprogramming Workloads . 82 6.1.1 Experimental Setup . 83 6.1.2 Fibonacci Benchmark . 83 6.1.3 MergeSort Benchmark . 88 6.1.4 MatMul Benchmark . 93 6.1.5 Detailed Comparison for the MatMul benchmark . 97 6.1.6 Choosing a Proper Cutoff . 99 6.2 Multiprogramming Workloads . 100 6.2.1 Case 1: Multiple Instances of a Single Program . 100 6.2.2 Case 2: Single Instances of Multiple Programs . 103 6.3 Summary . 103 7 Parallel Lower-Upper Factorisation of Sparse Matrices 105 7.1 GPRM Parallel Loops . 105 7.2 Matrix Multiplication Micro-benchmark . 108 7.3 Sparse LU Factorisation . 111 7.4 Results on the TILEPro64 . 115 7.4.1 OpenMP Performance Bottlenecks . 116 7.4.2 Comparison of the OpenMP and GPRM Solutions . 116.