Architecture-Neutral Parallelism Via the Join Calculus
Total Page:16
File Type:pdf, Size:1020Kb
UCAM-CL-TR-871 Technical Report ISSN 1476-2986 Number 871 Computer Laboratory Architecture-neutral parallelism via the Join Calculus Peter R. Calvert July 2015 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2015 Peter R. Calvert This technical report is based on a dissertation submitted September 2014 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Trinity College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract Ever since the UNCOL efforts in the 1960s, compilers have sought to use both source- language-neutral and architecture-neutral intermediate representations. The advent of web applets led to the JVM where such a representation was also used for distribution. This trend has continued and now many mainstream applications are distributed using the JVM or .NET formats. These languages can be efficiently run on a target architecture (e.g. using JIT techniques). However, such intermediate languages have been predominantly sequential, supporting only rudimentary concurrency primitives such as threads. This thesis proposes a parallel intermediate representation with analo- gous goals. The specific contributions made in this work are based around a join calculus abstract machine (JCAM). These can be broadly categorised into three sections. The first contribution is the definition of the abstract machine itself. The standard join calculus is modified to prevent implicit sharing of data as this is undesirable in non- shared memory architectures. It is then condensed into three primitive operations that can be targeted by optimisations and analyses. I claim that these three simple operations capture all the common styles of concurrency and parallelism used in current programming languages. The work goes on to show how the JCAM intermediate representation can be imple- mented on shared-memory multi-core machines with acceptable overheads. This process illustrates some key program properties that can be exploited to give significant benefits in certain scenarios. Finally, conventional control-flow analyses are adapted to the join calculus to allow the properties required for optimising compilation to be inferred. Along with the proto- type compiler, this illustrates the JCAM's capabilities as a universal representation for parallelism. Acknowledgements This dissertation would never have been completed without the help and support of a number of people. My supervisor, Alan Mycroft, has allowed me the freedom to explore my own ideas, and been invaluable in teaching me how to present them. I have been fortunate for the discussions that I've had with numerous members of the Computer Laboratory over the last four years, particularly with other members of the programming research group. There are too many to name, but particular thanks to Dominic, Robin, Jukka, Tomas, Raoul, Janina, Stephen and Raphael. I must also thank all those who supervised me during my time as an undergraduate, especially Andy Rice, for their enthusiasm which encouraged me to pursue a PhD. I am also grateful to the Schiff Foundation for their funding which enabled me to un- dertake this research, and to Trinity College for its support during my time at Cambridge. Finally, I would like to thank my parents for their endless encouragement and advice. Contents 1 Introduction 11 1.1 Research Context . 12 1.1.1 Increased Development Productivity . 12 1.1.2 Reduced Barrier to Research . 13 1.1.3 Elegance . 13 1.2 Outline and Contributions . 14 1.3 General Notation . 15 1.4 Publications . 15 2 Technical Background 17 2.1 Computer Architecture . 17 2.1.1 Moore's Law . 17 2.1.2 The Power Wall . 18 2.1.3 Multi-Core . 18 2.1.4 Vector Processors . 18 2.1.5 Heterogeneous Architectures . 20 2.1.6 Moving away from Uniform Memory . 20 2.1.7 Dataflow Architectures . 21 2.2 Virtual Machines and Intermediate Representations . 21 2.2.1 Conventional Approaches . 22 2.2.2 Dataflow Approaches . 23 2.2.3 Alternative Techniques . 23 2.3 Programming Language Models of Parallelism . 25 2.3.1 Threads and Tasks . 25 2.3.2 CUDA and OpenCL . 31 2.3.3 Nested Data Parallelism . 32 2.3.4 Embedded Domain Specific Languages . 33 2.3.5 Dataflow and Streaming Languages . 33 2.3.6 Message Passing . 34 2.4 The Join Calculus . 36 2.4.1 Origins, Semantics and Related Models . 37 2.4.2 JoCaml . 39 2.4.3 The Joins Library . 40 2.5 Control-Flow Analysis . 43 2.5.1 Constraint-based . 43 2.5.2 Abstract Interpretation . 45 2.5.3 Call Strings . 48 2.5.4 Escape-based Techniques . 49 2.6 Summary . 49 3 The Join Calculus Abstract Machine 51 3.1 Arriving at the Join Calculus . 51 3.1.1 Supporting Non-Deterministic Choice . 51 3.1.2 Exposing Fine-Grained Parallelism . 52 3.1.3 Reintroducing Functions and Dynamic Behaviour . 54 3.1.4 Granularity . 57 3.2 Making Data Dependencies Explicit . 58 3.3 The Join Calculus Abstract Machine . 61 3.3.1 Semantics . 63 3.3.2 Paths and Traces . 65 3.4 Relation to Other Models . 66 3.4.1 Coordination Zoo . 67 3.4.2 Streaming Computations . 69 3.4.3 Task-based Parallelism . 70 3.4.4 Transactions . 74 3.4.5 Objects and Actors . 75 3.4.6 Complex Memories . 76 3.5 Summary . 81 4 Efficient Implementation 83 4.1 A Baseline Implementation . 83 4.1.1 Specialised Emission Functions . 84 4.1.2 Join Calculus Data Structures . 85 4.1.3 Work Stealing for the Join Calculus . 87 4.2 Profiling Bottlenecks . 88 4.3 Closed Definitions . 88 4.4 Bounded Queues . 90 4.4.1 Annotations . 92 4.4.2 Implementation . 92 4.5 Inlining . 93 4.5.1 Transition Inlining . 94 4.5.2 Definition Inlining . 95 4.6 Summary . 96 5 Control Flow Analysis 97 5.1 Translating 0-CFA to the Join Calculus . 97 5.2 Dealing with Message Interaction: 0-LCFA . 99 5.3 Abstracting Call-DAGs: k-LCFA . 102 5.4 Correctness of k-LCFA . 107 5.4.1 Constraint Generation . 108 5.4.2 Closure Algorithm . 112 5.5 Queue Bounding . 114 5.6 Closedness of Definitions . 115 5.7 Worked Examples . 115 5.7.1 Foreground call-strings example (`handshake') . 116 5.7.2 Closedness example (fib) . 118 5.8 Summary . 120 6 Evaluation and Discussion 121 6.1 Test Environments . 121 6.2 Benchmarks . 122 6.2.1 fib ....................................123 6.2.2 nqueens .................................123 6.2.3 quicksort ................................123 6.2.4 locks ..................................123 6.2.5 barrier .................................123 6.2.6 rwlock ..................................123 6.2.7 queue ..................................124 6.2.8 blackscholes ..............................124 6.3 Sequential Performance . 124 6.4 Scaling to Multiple Cores . 126 6.5 Accuracy of Analysis . 131 6.6 Better Scheduling . 132 6.6.1 The Scheduling Problem . 132 6.6.2 Improvements on Basic Work Stealing . 133 6.6.3 Transition Inlining . 134 6.7 Summary . 136 7 Conclusion 137 7.1 Compilation to Other Architectures . 138 7.2 Further Analysis . 138 7.3 Better Scheduling and Inlining . 139 7.4 Final Remarks . 139 Bibliography 149 Chapter 1 Introduction The shift in processor design from ever-increasing clock speeds to multi- and many-core parallelism has been well-documented [109]. Shared-memory multicore systems are now ubiquitous; clusters of these common place; general purpose GPUs mainstream; and the range of esoteric research designs ever-expanding. Indeed, techniques to exploit such architectures have been an especially hot research topic over the last decade. However, for the most part, these have considered a restricted set of problem domains or architectures. It has even been argued that building specialised compilers from each form of parallelism to each target is the only viable approach [27, 86]. Such a conclusion does not sit well in computer science, a discipline almost defined by the view that \all problems . can be solved by another level of indirection" (David Wheeler). Nowhere is this more true than in the compiler community which has used abstractions so effectively for sequential architectures. The Java Virtual Machine (JVM).