Gdura, Youssef Omran (2012) a New Parallelisation Technique for Heterogeneous Cpus

Gdura, Youssef Omran (2012) A new parallelisation technique for heterogeneous CPUs. PhD thesis http://theses.gla.ac.uk/3406/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] . A New Parallelisation Technique for Heterogeneous CPUs by Youssef Omran Gdura A Thesis presented for the degree of Doctor of Philosophy to the School of Computing Science, College of Science and Engineering, University of Glasgow May 2012 . Copyright © Youssef Omran Gdura, 2012. Abstract Parallelization has moved in recent years into the mainstream compilers, and the demand for parallelizing tools that can do a better job of automatic parallelization is higher than ever. During the last decade considerable attention has been focused on developing programming tools that support both explicit and implicit parallelism to keep up with the power of the new multiple core technology. Yet the success to develop automatic parallelising compilers has been limited mainly due to the complexity of the analytic process required to exploit available parallelism and manage other parallelisation measures such as data partitioning, alignment and synchronization. This dissertation investigates developing a programming tool that automatically parallelises large data structures on a heterogeneous architecture and whether a high-level programming language compiler can use this tool to exploit implicit parallelism and make use of the performance potential of the modern multicore technology. The work involved the development of a fully automatic parallelisation tool, called VSM, that completely hides the underlying details of general purpose heterogeneous architectures. The VSM implementation provides direct and simple access for users to parallelise array operations on the Cell’s accelerators without the need for any annotations or process directives. This work also involved the extension of the Glasgow Vector Pascal compiler to work with the VSM implementation as a one compiler system. The developed compiler system, which is called VP-Cell, takes a single source code and parallelises array expressions automatically. Several experiments were conducted using Vector Pascal benchmarks to show the validity of the VSM approach. The VP-Cell system achieved significant runtime performance on one accelerator as compared to the master processor’s performance and near-linear speedups over code runs on the Cell’s accelerators. Though VSM was mainly designed for developing parallelising compilers it also showed a considerable performance by running C code over the Cell’s accelerators. i Declaration This thesis presented a work that was carried at the University of Glasgow under the super- vision of Dr. William Cockshott and Dr. John O’Donnell , School of Computing Science, during the period between March 2007 to January 2012. I declare that the work is entirely my own work and it has not been previously submitted for any other degree or qualification in any university. Youssef Omran Gdura Glasgow, May 2012 ii Acknowledgments I would like to express my grateful acknowledgment to my supervisors Dr. William Paul Cockshott for his guidance, enthusiasm and assistance during the course of this PhD and to my second supervisor Dr. John O’Donnell for his support. I also wish to express my sincere thanks to the School of Computing Science, University of Glasgow for creating an environment that has been fabulous for research and fun. Also this work would not have been possible without the financial support from the Ministry of Education in Libya, I am grateful to them for this opportunity. To my external and internal examiners, Prof Sven-Bodo Scholz and Dr Wim Vander- bauwhede, for their interest in this work and for taking the time to study this thesis extensively. I would like also to take this opportunity to express my sincere appreciation to all the staff at University of Glasgow for their support and valuable assistance during the Libyan crisis in 2011. I dedicate this work for the soul of my father and mother, who passed away while I was doing my Master degree and to the soul of my oldest brother and oldest sister, who passed away while I am doing my PhD. Special thanks to my beloved wife, Nadia, for her patience and providing me comfort and to the five parallel accelerators in my life my daughters, Raian, Raihan and Jeanan and my sons, Omran and Abdalrahman, for giving me their love, support and all the joy. I am most appreciative of my brothers and sisters for their continuing support, as without their supports and wishes, I couldn’t achieve what I have achieved today. Above all, I thank Allah almighty for his unlimited blessings. Youssef Glasgow, May 2012 iii Contents Abstract i List of Figures v List of Tables viii List of Publications ix 1 Introduction 1 1.1 Thesis Statement . .3 1.2 Motivations . .4 1.3 Objectives . .5 1.4 Contributions . .6 1.5 Outline . .7 2 Background 10 2.1 Overview of Parallel Processors . 10 2.2 Parallel Memory Architectures . 12 2.3 Parallel Computing . 13 2.4 Parallel Programming Paradigms . 22 2.5 Virtual Machines . 33 2.6 Genetic Algorithms . 34 3 The Cell Processor & Vector Pascal 37 3.1 Introduction to PowerPC Architectures . 37 3.2 The Cell Broadband Engine Processor . 47 3.3 Glasgow Vector Pascal . 60 3.4 Assembly Language Directives . 67 4 Related work 68 4.1 Compute Unified Device Architecture (CUDA) . 68 4.2 Open Computing Language (Open CL) . 71 4.3 Programming the Cell BE Architecture . 75 iv Contents 5 Virtual SIMD Machine 96 5.1 Introduction . 96 5.2 Virtual SIMD Registers . 98 5.3 Virtual SIMD Instructions . 100 5.4 VSM Messaging Protocol . 102 5.5 PPE Interpreter . 104 5.6 SPE Interpreter . 109 5.7 Using VSM . 123 5.8 Experimental Results . 125 6 Host Compiler Development 135 6.1 PowerPC Machine Description . 136 6.2 Machine-dependent Routines . 140 6.3 Assembly Macros . 142 6.4 Stack Frame Operations . 144 6.5 Compiler Building Process . 145 6.6 PowerPC Compiler Extension . 146 6.7 Building the VP-Cell Compiler System . 152 6.8 Coding . 154 7 Code Generator Optimiser 155 7.1 Introduction . 155 7.2 Previous Work . 156 7.3 Permutation Technique . 157 7.4 Why Instructions Ordering is a Problem . 158 7.5 Genetic Algorithm Approach to the Problem . 161 7.6 Key Design Aspects . 162 7.7 Implementation . 167 7.8 Experimental Results . 171 8 Evaluating VP-Cell Compiler 179 8.1 Machine Configuration . 179 8.2 Testing Developed Tools . 180 8.3 Performance Tuning . 181 8.4 Experimental Results . 182 9 Conclusion and Future Work 206 9.1 Contribution . 207 9.2 Future Work . 208 v List of Figures 2.1 Arrays Alignment . 17 2.2 Sample of APL’s Character Set . 22 2.3 ZPL Directions Operators . 25 2.4 Defining Arrays in SaC . 28 2.5 SaC WITH loop . 29 3.1 Pascal Code and CISC Assembly Instruction . 38 3.2 Pascal Code and RISC Assembly Instruction . 38 3.3 PowerPC Instruction Format . 39 3.4 PowerPC ABI registers conventions . 40 3.5 The Cell BE Schematic Diagram . 49 3.6 SPE Hardware Diagram . 50 3.7 SPE_CONTROL_AREA Structure . 56 3.8 SPE Hardware Diagram . 60 3.9 Pascal Code Segment . 62 3.10 A Simple VP Pure function . 65 4.1 OpenCL Functions . 74 4.2 Scalar C Function . 74 4.3 Parallel OpenCL Kernel . 74 4.4 OpenMP Program Execution . 79 4.5 Sieve Function Prototype . 81 4.6 Sieve Block . 81 4.7 SieveC++ Implementation . 82 4.8 Sieve C++ Code . 83 4.9 Offload Function Prototype . 85 4.10 Offload C++ Code . 86 4.11 C++ Class . 93 4.12 C++ Code Segment Using TBB libraries . 93 5.1 Splitting a VSM Register on 4 SPEs . 98 5.2 Message Formats . 103 vi List of Figures 5.3 Broadcasing PPE Messages to SPES . 105 5.4 SPE Thread Creation . 106 5.5 Launching SPE Thread Using POSIX Threads . 106 5.6 Barrier Synchronisation Routine . 107 5.7 Store Virtual SIMD Instruction . 109 5.8 Load Virtual SIMD Instruction . 110 5.9 SPE Interpreter Structure . 111 5.10 Messaging Pulling Code Segment . 112 5.11 Loading Unaligned Data to SPE’s Local Memory . 114 5.12 Splitting an SPE Block into 3 DMAs for Storing Process . 116 5.13 Synchronise Shared-Block . 118 5.14 Storing Middle Block . ..

Gdura, Youssef Omran (2012) a New Parallelisation Technique for Heterogeneous Cpus

A Type Inference on Executables

Targeting Embedded Powerpc

Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C

Interprocedural Analysis of Low-Level Code

Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives

How to Write Code That Will Survive

Mac OS X ABI Function Call Guide

CAPS Openacc Compiler

Using CAPS Compiler on NVIDIA Kepler and CARMA Systems

This Is the GPU Based ANUGA

Downloads Parallel Code and GPU So That the Program Can Execute Output Data from the HWA to the Host

Program Sequentially, Carefully, and Benefit from Compiler Advances For