Thesis.Pdf (PDF, 656Kb)
Total Page:16
File Type:pdf, Size:1020Kb
Extending old languages for new architectures Leo P. White University of Cambridge Computer Laboratory Queens' College October 2012 This dissertation is submitted for the degree of Doctor of Philosophy Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the regulation length of 60 000 words, including tables and footnotes. Extending old languages for new architectures Leo P. White Summary Architectures evolve quickly. The number of transistors available to chip designers doubles every 18 months, allowing increasingly complex architectures to be developed on a single chip. Power dissipation issues have forced chip designers to look for new ways to use the transistors at their disposal. This situation inevitably leads to new architectural features on a fairly regular basis. Enabling programmers to benefit from these new architectural features can be problematic. Since architectures change frequently, and compilers last for a long time, it is clear that compilers should be designed to be extensible. This thesis argues that to support evolving architectures a compiler should support the creation of high-level language extensions. In particular, it must support extending the compiler's middle-end. We describe the design of EMCC, a C compiler that allows extension of its front-, middle- and back-ends. OpenMP is an extension to the C programming language to support parallelism. It has recently added support for task-based parallelism, a dynamic form of parallelism made popular by Cilk. However, implementing task-based parallelism efficiently requires much more involved program transformation than the simple static parallelism originally supported by OpenMP. We use EMCC to create an implementation of OpenMP, with particular focus on efficient implementation of task-based parallelism. We also demonstrate the benefits of supporting high-level analysis through an extended middle-end, by developing and implementing an interprocedural analysis that improves the performance of task-based parallelism by allowing tasks to share stacks. We develop a novel generalisation of logic programming that we use to concisely express this analysis, and use this formalism to demonstrate that the analysis can be executed in polynomial time. Finally, we design extensions to OpenMP to support heterogeneous architectures. Acknowledgments I would like to thank my supervisor Alan Mycroft for his invaluable help and advice. I would also like to thank Derek McAuley for helpful discussions. I also thank Netronome for funding the work. Special thanks go to Zo¨efor all her support. Contents 1 Introduction 9 1.1 Evolving architectures require extensible compilers . .9 1.2 Extensibility and compilers . 10 1.2.1 The front-, middle- and back-ends . 11 1.2.2 Declarative specification of languages . 11 1.2.3 Preprocessors and Macros . 13 1.3 Extending old languages for new architectures . 14 1.4 Compilers for multi-core architectures . 15 1.5 OpenMP: Extending C for multi-core . 16 1.6 Contributions . 16 1.7 Dissertation outline . 17 2 Technical background 19 2.1 The OCaml programming language . 19 2.1.1 The module system . 19 2.1.2 The class system . 20 2.1.3 Generalised Algebraic Datatypes . 20 2.2 Task-based parallelism . 23 2.2.1 Support for task-based parallelism . 24 2.3 OpenMP . 25 2.4 Models for parallel computations . 27 2.4.1 Parallel computations . 27 2.4.2 Execution schedules . 28 2.4.3 Execution time and space . 29 2.4.4 Scheduling algorithms and restrictions . 30 2.4.5 Optimal scheduling algorithms . 30 2.4.6 Efficient scheduling algorithms . 31 2.4.7 Inherently inefficient example . 32 2.5 Logic programming . 33 5 2.5.1 Logic programming . 33 2.5.2 Negation and its semantics . 34 2.5.3 Implication algebra programming . 36 3 EMCC: An extensible C compiler 39 3.1 Related Work . 40 3.1.1 Mainstream compilers . 40 3.1.2 Extensible compilers . 41 3.1.3 Extensible languages . 43 3.2 Design overview . 44 3.3 Patterns for extensibility . 46 3.3.1 Properties . 46 3.3.2 Visitors . 47 3.4 Extensible front-end . 49 3.4.1 Extensible syntax . 49 3.4.2 Extensible semantic analysis and translation . 50 3.5 Extensible middle-end . 50 3.5.1 Modular interfaces . 50 3.5.2 CIL: The default IL . 52 3.6 Conclusion . 54 4 Run-time library 57 4.1 Modular build system . 58 4.2 Library overview . 59 4.2.1 Thread teams . 59 4.2.2 Memory allocator . 59 4.2.3 Concurrency primitives . 59 4.2.4 Concurrent data structures . 60 4.2.5 Worksharing primitives . 60 4.2.6 Task primitives . 60 4.3 Atomics . 60 4.4 Conclusion . 63 5 Efficient implementation of OpenMP 65 5.1 Efficiency of task-based computations . 66 5.1.1 Execution model . 66 5.1.2 Memory model . 68 5.1.3 Scheduling tasks efficiently . 68 5.1.4 Scheduling OpenMP tasks efficiently . 69 5.1.5 Inefficiency of stack-based implementations . 70 6 5.1.6 Scheduling overheads . 71 5.2 Implementing OpenMP . 71 5.2.1 Efficient task-based parallelism . 71 5.2.2 Implementing OpenMP in EMCC . 74 5.2.3 Implementing OpenMP in our run-time library . 75 5.3 Evaluation . 76 5.3.1 Benchmarks . 76 5.3.2 Results . 77 5.4 Related work . 80 5.5 Conclusion . 81 6 Optimising task-local memory allocation 83 6.1 Model of OpenMP programs . 85 6.1.1 OpenMP programs . 85 6.1.2 Paths, synchronising instructions and the call graph . 86 6.2 Stack sizes . 87 6.3 Stack size analysis using implication programs . 88 6.3.1 Rules for functions . 89 6.3.2 Rules for instructions . 90 6.3.3 Optimising merged and unguarded sets . 91 6.3.4 Finding an optimal solution . 93 6.3.5 Adding context-sensitivity . 94 6.4 The analysis as a general implication program . 95 6.4.1 Stack size restrictions . 95 6.4.2 Restriction rules . 95 6.4.3 Other rules . 97 6.4.4 Extracting solutions . 97 6.5 Stratification . 100 6.6 Complexity of the analysis . 101 6.7 Implementation . 101 6.8 Evaluation . 102 6.9 Conclusion . 105 7 Extensions to OpenMP for heterogeneous architectures 107 7.1 Heterogeneous architectures and OpenMP . 107 7.2 Related work . 109 7.3 Design of the extensions . 112 7.3.1 Thread mapping and processors . 112 7.3.2 Subteams . 113 7 7.3.3 Syntax . 113 7.3.4 Examples . 114 7.4 Implementation . 115 7.5 Experiments . 116 7.6 Conclusion . 118 8 Conclusion and future work 119 8.1 Conclusion . ..