Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors Alejandro Caro

Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors by Alejandro Caro S.B., Computer Science and Engineering, MIT 1990 S.M., Electrical Engineering and Computer Science, MIT 1993 Engineer in Computer Science, MIT 1998 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 1999 @ 1999 Massachusetts Institute of Technology. All rights reserved. A u th or ....................... ....... ......... ................ Department 4f Electrical Engineering and Computer Science January 27, 1999 Certified by .:... .....- :...... 5.< ....... ............. V.. Arvind.I..v( Johnson Professor of Computer Science Thesis Supervisor 2' Accepted by .. .... .... .. .. .... Arthur C. Smith Chairman, Departmeital Committee on Graduate Students Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors by Alejandro Caro Submitted to the Department of Electrical Engineering and Computer Science on January 27, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This dissertation presents pHc, a new compiler for Parallel Haskell (pH) with complete support for the entire language. pH blends the powerful features of a high-level, higher- order language with implicitly parallel, non-strict semantics. The result is a language that is easy to use but tough to compile. The principal contribution of this work is a new code- generation algorithm that works directly on A*, a terse intermediate representation based on the A-calculus. All the power of the original language is preserved in A*, and in addition, the new representation makes it easy to express the important concept of threads, groups of expressions that can execute sequentially, at a high-level. Code generation proceeds in two steps. First, the compiler turns A* programs into multithreaded code for the Shared-Memory Threaded (SMT) machine, an abstract symmetric multiprocessor (SMP). The SMT machine simplifies the code-generator, by providing special mechanisms for parallelism, and improves its portability, by isolating it from any particular SMP. Unlike previous compilers for similar languages, the SMT code generated by pHc makes use of suspensive threads for better use of sequential processors. Second, the compiler transforms SMT instructions into executable code. In keeping with the goals of simplicity and portability, the final product is emitted as a C program, with some exten- sions for compiler-controlled multithreading. This code is linked with a Run-Time System (RTS) that provides two facilities: a work-stealing scheduler and a memory manager with garbage-collection. To evaluate the performance of the code, we generated instrumented programs. Us- ing a small but varied set of benchmarks, we measured the components of run-time, the instruction mix, the scalability of code up to eight processors, and as a comparison, the single-processor performance against two of the latest Haskell compilers. We found encour- aging speedups up to four processors, but rarely beyond. Moreover, we found substantial overhead in single-processor performance. These results are consistent with our emphasis on completeness and correctness first and on performance second. We conclude with some suggestions for future work that will remedy some of the shortcomings discovered in the evaluation. Thesis Supervisor: Arvind Title: Johnson Professor of Computer Science Acknowledgments I have been part of the Computation Structures Group for ten years, and in that decade, I have gone from queasy undergraduate to confident, some would say hardened, graduate. I cannot imagine time better spent. Through it all, Arvind has been the steady helm of CSG, and from his insights and his approach to hard questions, I have learned the course to good research. It follows a deceptively simple bearing: choose your work because you find it worthwhile, not because it is the fashion, and do it until you think it complete, not until others happen to judge it so. None of us can predict the long-term impact of our research, but work of high quality will not easily founder. CSG originally attracted me because its dataflow projects involved a rethinking of com- puting from top to bottom. In Monsoon and StarT, we did it all, from programming language theory to high-performance hardware design. But none of this would have been possible without a superb crew. Together we tackled problems that were big, tough, un- yielding, and we persevered in the face of more obstacles than we care to remember. I was especially lucky to have worked closely with several generations, in particular Shail Aditya, Boon Ang, Andy Boughton, Derek Chiou, James Hoe, Jamie Hicks, Chris Joerg, R. Paul Johnson, Jan-Willem Maessen, and Ken Traub. I was fortunate too in my choice of thesis committee. In Martin Rinard and Vivek Sarkar, I found not only two experts in parallel compilers, which is rare enough, but also two enthusiastic and approachable colleagues, which was a real treat. This dissertation was markedly improved by their keen observations. Finally, as this long journey comes to an end, my thoughts turn to my family. Their unwavering faith in its successful completion stayed mine during some crushing ebbs. It is because of them that I never felt alone, and it is to them that I dedicate this work. A mi familia, con mucho carifio. Contents 1 Introduction 15 1.1 pH: Parallel Programming Made Easy 15 1.2 Complications . .. ... .. .. ... 16 1.3 Contributions .. .... ... .... 17 1.4 Dissertation Overview . ... ... .. 18 2 Parallel Haskell 19 2.1 Origins and Design .. 19 2.2 Type System . .. .. 22 2.3 Higher-Order Functions 24 2.4 Implicit Parallelism. 26 2.5 Lenient Evaluation . .. 27 2.6 Synchronizing Memory . 30 2.7 Barriers . .. .. .. .. 33 2.8 The A*-Calculus . 35 2.9 Summary .. .. .. 38 3 Compilation Challenges 41 3.1 Polymorphism. .. .. ... ... ... .. 41 3.2 Higher Order Functions . .. .. .. ... .. 44 3.3 Non-Strictness . .. .. 49 3.4 Synchronizing Memory . 55 3.5 Barriers . .. .. 57 3.6 Garbage Collection . .. 58 3.7 Summary . .. .. 59 9 4 The SMT Machine 61 4.1 Overview .... .. .. .. .. .. .. .. .. .. 6 2 4.2 Arithmetic and Move: plus, etc., and move ......... .. .. 66 4.3 Control Flow: jump and switch .. ... .. ... .. .. .. .. .. 66 4.4 Scheduling: spawn and schedule .. ... ... ... .. .. ... .. ... 67 4.5 Memory: allocate, load, touch, store, istore, and take .. .. ... .. ... 69 4.6 Barrier Instructions: initbar, touchbar, incbar, and decbar .. .. ... .. .. 72 4.7 Sum m ary ... .... ... ... ... .... ... ... .. .. .. .. 74 5 Compiling A* to SMT 75 5.1 Execution Environment . ... ... ... .... ... 75 5.2 Compiling Simple, Primitive, and Data Expressions ... .. ... .. .. 80 5.3 Compiling Case Expressions .. ... ... .... ... ... .... .. 86 5.4 Compiling A-Abstractions and Function Calls 87 5.5 Compiling Letrec Expressions .. ....... 100 5.6 Compiling Programs ........ ...... 109 5.7 Sum m ary ........ ......... .. .. .. 112 6 Better Threaded Code 113 6.1 Suspensive versus Non-Suspensive Threads . ..... .... .... 114 6.2 SMT Code from Threaded Programs . ... .. .. .. .. .. .. 116 6.3 Optimizations Enabled by Threading . .. .. .. .. .. .. .. 123 6.4 Partitioning ... ... .... .... .... .. .. .. .. .. .. 130 6.5 Partitioning in pHc .. ... .... .... .. ... .. .. ... 133 6.6 Partitioning fib . ... ... .... ... .. .. .. .. .. .. 135 6.7 Threading fib .. ... ... ... ... .. ... .. .. .. ... .. 136 6.8 When Programs Become Too Serial .. ... ... ... ... ... .. 141 6.9 Incremental Spawning . ..... ..... .. ... ... ... ... 143 6.10 Summary .... ... .... .... .... ... ... .... ... 143 7 Executable Code Generation 145 7.1 Implementing SMT Processors ... ... 146 7.2 Implementing the Global Work-Queue . 151 10 7.3 pH Data Representation .. .. ... .. ... .. ... ... ... .. ... 155 7.4 Choices for Code Generation . ... ... .. ... .. ... ... .. ... 160 7.5 Avoiding Stack Allocation .. ... ... ... ... .. .. .. .. .. .. 162 7.6 Implementing SMT Instructions . ... ... ... .. .. .. ... ... .. 165 7.7 Garbage Collection .. ... ... ... ... ... ... .. ... ... .. .. 178 7.8 Machine Dependences and Porting .. ... ... .. .... .... ... 180 7.9 Summary . ... ... ... ... .. ... ... ... ...... ...... 181 8 Preliminary Evaluation 183 8.1 Experimental Setup . ... .. ... .. .. .... ... 183 8.2 Components of Execution Time . .... .. .. ... ... 185 8.3 Instruction Mix . .. ... .. .. .. ... ... ... .. 187 8.4 Touch Behavior .. ... ... ... ... .. .. ... ... 189 8.5 Speedup .. .. .. .. .. .. .. ... .. .. ... ... .. ... ... .. 190 8.6 Single-Processor Performance vs. Haskell . ... .. .. .. ... .. .. 191 . 8.7 Summary ... ... ... .. ... ... .. .. ... ... .. ... ... .. 195 9 Related Work 197 9.1 Id/pH Compilers .. ... .. ... .. ... .. ... .. ... .. .. .. 197 9.2 Concurrent and Parallel Versions of Haskell . ... .. ... ... .. ... 200 9.3 Cilk Scheduling ... .. ... ... ... .. ... ... ... .. ... ... 202 10 Conclusion 203 10.1 Contributions . ... ... .. ... ... ... ... ... ... .. ... .. 203 10.2 Future Work ... ... .. ... ... .. ... ... .. ... ... .. .. 204 10.3 Conclusion .. .. ... .. ... .. ... .. ... .. ... .. .. ... 207 A Semantics of As 213 A.1 Syntax for As -... - . - .. ... .. .. .. ... .. .. ... .. .. 214 A.2 Contexts in As . - . .. - . --- -- --- -- --- .- .. ... .. 215 A.3 Equivalence Rules for As . - --- --- ---- - . ... 216 A.4 Reduction Rules for As

Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors Alejandro Caro

Reversible Programming with Negative and Fractional Types

Let's Get Functional

Notes on Functional Programming with Haskell

Engineering Definitional Interpreters

An Architectural Trail to Threaded-Code Systems

Design Patterns in Ocaml

Scala by Example (2009)

Open Programming Language Interpreters

61A Lecture 6 Lambda Expressions VS Currying

A Hybrid Approach of Compiler and Interpreter Achal Aggarwal, Dr

Basic Threads Programming: Standards and Strategy

Effective Inline-Threaded Interpretation of Java Bytecode