Keir, Paul G (2012) Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Keir, Paul G (2012) Design and implementation of an array language for computational science on a heterogeneous multicore architecture. PhD thesis. http://theses.gla.ac.uk/3645/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture Paul Keir Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computing Science College of Science and Engineering University of Glasgow July 8, 2012 c Paul Keir, 2012 Abstract The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufac- ture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high pro- cessor speeds, and huge latency to main memory. Such chips tackle this memory wall by the provision of addressable caches; increased bandwidth to main memory; and fast thread con- text switching. An associated cost is often reduced functionality of the individual accelerator cores; and the increased complexity involved in their programming. This dissertation investigates the application of a programming language supporting the first- class use of arrays; and capable of automatically parallelising array expressions; to the het- erogeneous multicore domain of the CBE, as found in the Sony PlayStation R 3 (PS3). The language is a pre-existing and well-documented proper subset of Fortran, known as the ‘F’ programming language. A bespoke compiler, referred to as E], is developed to support this aim, and written in the Haskell programming language. The output of the compiler is in an extended C++ dialect known as Offload C++, which tar- gets the PS3. A significant feature of this language is its use of multiple, statically typed, address spaces. By focusing on generic, polymorphic interfaces for both the generated and hand constructed code, a number of interesting design patterns relating to the memory local- ity are introduced. A suite of medium-sized (100-700 lines), real-world benchmark programs are used to eval- uate the performance, correctness, and scalability of the compiler technology. Absolute speedup values, well in excess of one, are observed for all of the programs. The work ultimately demonstrates that an array language can significantly reduce the effort expended to utilise a parallel heterogeneous multicore architecture, while retaining high per- formance. A substantial, related advantage in using standard ‘F’ is that any Fortran compiler can create debuggable, and competitively performing serial programs. Acknowledgements I would firstly like to thank my wonderful wife, Steffi, not only for her moral support, pa- tience, and above all, love; but also for her practical advice, albeit through the refreshingly alternative perspective of a doctor of letters. Many thanks are also due to my supervisor, Dr. Paul Cockshott, who possesses an uncanny, raw intelligence, and knack for constructing the very question one might hope to avoid; to the benefit of all in earshot. Paul was exactly the supervisor I needed, and I greatly appreciate his kindness, candour, wisdom, and tolerance over these last years. I have seen far more of my second supervisor, Dr. John O’ Donnell, than is prescribed by reg- ulation, and I have benefited from our regular meetings more than he may know. Thankyou John for patiently introducing me to the mathematics behind programming languages. Thankyou to all the academic and administrative staff in the School of Computing Science. Dr. Alastair Donaldson, my first industrial supervisor, thankyou. You provided a solid op- erational structure; enthusiasm for the topic; and nothing less than a complete and thorough answer to any question I ever posed. Regrettably, time with my second industrial supervisor, Dr. Andrew Cook, was brief. Your assistance was nevertheless gratefully received. Andrew Richards, my third and final industrial supervisor, please accept my thanks also. Without your early decisions and integrity I would not be writing this. Thankyou Uwe Dolinsky, for your attention to detail, and diligent responses to my technical emails. To all those at Codeplay Software Ltd., many thanks for your assistance. To Prof. Lenore Mullin, thankyou kindly for being our SICSA distinguished visitor, and introducing the school to the mathematics of arrays. Your wisdom and ideas remain a source of inspiration long after your departure. To my dear colleague and companion, Youssef Gdura, you have reminded me of the essential quality of a good sense of humour. Your constant humility and equanimity, even in the face of a turbulent period for your homeland, remains a valuable lesson. Thanks also go to Mark i Shannon for being a friend on the road. Your certainty with regard to both technical and collegiate issues provided an essential ying to my doubting yang. Thankyou to Wikipedia R and the Wikimedia Foundation Inc. ii Contents 1 Introduction 10 1.1 Thesis Statement . 12 1.2 Contributions . 13 1.3 Publications . 14 1.4 Outline . 15 2 Related Work 17 2.1 Library Support for Parallelism . 18 2.1.1 The Cell SDK . 18 2.1.2 Threading Building Blocks . 20 2.2 Annotations, Directives and Pragmas . 20 2.2.1 High Performance Fortran . 21 2.2.2 OpenMP . 23 2.2.3 Cell Superscalar . 29 2.3 Language Extensions . 30 2.3.1 Sieve C++ . 31 2.3.2 OpenCL . 33 2.3.3 Cilk++ . 34 2.3.4 Deterministic Parallel Java . 34 2.4 Array Languages . 35 2.4.1 APL . 35 2.4.2 Fortran 90 . 37 2.4.3 ZPL . 37 2.4.4 Single Assignment C . 39 2.4.5 Vector Pascal . 40 2.5 Partitioned Global Address Space Languages . 42 2.5.1 Co-Array Fortran . 43 2.5.2 Unified Parallel C . 44 2.5.3 MPI Virtual Process Topologies . 47 iii 3 Methodology 48 3.1 The Cell Broadband Engine . 49 3.2 Offload C++ . 52 3.2.1 Launching Threads . 52 3.2.2 Pointer Locality . 53 3.2.3 Template Declarations . 56 3.2.4 Call Graph Duplication . 57 3.2.5 Translation Units . 59 3.2.6 Offload Block Parameters . 59 3.2.7 Explicit Inner Pointers . 62 3.3 Fortran and F . 63 3.3.1 Modules and Variables in ‘F’ . 63 3.3.2 Array Sections . 66 3.3.3 Scalar and Array Pointers . 67 3.3.4 The Do Construct . 68 3.3.5 Subroutines and Functions . 70 3.3.6 Derived Data Types . 71 3.3.7 Differences between ‘F’ and Fortran . 72 3.3.8 Differences between Fortran and ‘C’ . 73 3.4 Implicit Parallelism . 73 3.4.1 Scalar Expressions . 73 3.4.2 Elemental Functions and Array Expressions . 75 3.4.3 Array Assignment . 76 3.4.4 Parallel Execution . 77 4 Compiler Implementation 81 4.1 Overview of the E] toolchain . 82 4.2 Intermediate Representations . 84 4.2.1 The Parse Tree . 85 4.2.2 The Typed ‘F’ Abstract Syntax Tree . 89 4.2.3 The Offload C++ Abstract Syntax Tree . 90 4.3 Parsing with Monadic Combinators . 93 4.3.1 Predictive Parsing . 96 4.3.2 Combinators for Parsing ‘F’ . 100 4.4 Object Binding . 100 4.5 Translating into C++ . 103 4.5.1 Basic Types . 104 4.5.2 Pointers . 106 iv 4.5.3 Derived Types . 107 4.5.4 Functions and Subroutines . 111 4.5.5 Program Units . 112 4.6 Transforming Array Expressions . 113 4.6.1 Scrap Your Boilerplate . 114 4.6.2 Transforming the ‘F’ AST using SYB . 116 4.7 Optimisation for Constant Sizes . 122 5 Runtime Support and Static Pointer Locality 125 5.1 Scalar ‘F’ Types in E] ............................. 126 5.1.1 Representing ‘F’ Character Strings . 126 5.2 The E] Array Runtime Library . 127 5.2.1 Array Class Templates . 128 5.2.2 Array Declarations . 128 5.2.3 Implementing Essential ‘F’ Array Operations . 131 5.3 Interfacing with the ‘F’ Runtime Libraries . 135 5.3.1 Calculating the Result Type and Kind . 135 5.3.2 Application to Matrix Multiplication . 137 5.4 Manual Replication in Offload C++ . 139 5.4.1 Problem Description . 140 5.4.2 New with Locality . 144 5.4.3 Inner Pointers and Outer Duplication . 147 5.4.4 A Pointer Locality Class . 149 5.4.5 Closing Thoughts on Manual Replication . 153 5.5 Test Driven Development . 154 6 Experimental Results 156 6.1 Preliminary Setup and Experiments . 157 6.1.1 Benchmark Timing Routines . 157 6.1.2 SPU Memory Footprint Macro . 157 6.1.3 GNU Fortran Runtime SPU Memory Footprint . 158 6.1.4 New Operator SPU Memory Footprint . 159 6.1.5 Launch and Join Microbenchmark . 160 6.1.6 Benchmark Program Code Sizes . 160 6.1.7 Compilation Times . ..