Guided Automatic Binary Parallelisation
Total Page:16
File Type:pdf, Size:1020Kb
Guided Automatic Binary Parallelisation Ruoyu (Kevin) Zhou St John's College This dissertation is submitted in July, 2017 for the degree of Doctor of Philosophy in Computer Science Abstract For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, ex- isting automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries. In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation infor- mation. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation be- tween the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution. The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation. BEEP is used to identify potential thread- level parallelism through static binary analysis and binary instrumentation. BEEP per- forms preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program. The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information. BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms. The third tool is named GABP (Guided Automatic Binary Parallelisation), an ex- tension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execu- tion. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers. Acknowledgements First and foremost, I would like to thank my supervisor Dr. Timothy Jones who guided me throughout my years in Cambridge. I appreciate the opportunities and challenges he offered me, and his valuable support that made this dissertation possible. I would have never got this far without his expertise and assistance. No matter how tough the technical challenges I encountered, he is always there for discussion and encourages me to find solutions and carry on. My sincere appreciation also goes out to: • Edmund Grimley-Evans, my advisor at ARM, for his invaluable insight in many technical challenges I faced. During my placement at ARM, thanks to his enor- mous support, my understanding in dynamic binary translation has significantly strengthened. • Robert Mullins, for providing constant source of encouragement and research ideas throughout my years in the Computer Lab from MPhil to PhD, especially during the weekend. • Niall Murphy, for his assistance in the implementation of parallel execution models. • Tom Sun, for his assistance in finding parallelism from SPEC CPU2006 and provid- ing many legacy binaries compiled by different compilers. • Sam Ainsworth, for his assistance in providing the software prefetching optimisation algorithms and test suites. • Dennis Zhang, for his assistance in resolving a few significant technical challenges. • Negar Miralaei, who tragically died shortly before finishing her dissertation, for her kind support and encouragement during the hard times. • Everyone in Computer Architecture Group, for discussions and exchange at scrum meetings every two days. • St John's College for funding my tuition fee during my study and research in Cam- bridge, for providing the best scenery, environment, food and accommodation in Cambridge. • ARM Ltd for funding my tuition fee and maintenance during my study and research in Cambridge. • My wife and parents, for always being been so loving, caring and supportive during my research. Contents 1 Introduction 13 1.1 Binary Recompilation . 13 1.2 Automatic Parallelisation . 14 1.3 Contributions . 15 1.4 Structure of this Dissertation . 17 2 Background 19 2.1 Dynamic Binary Translation . 19 2.1.1 Translation Process . 20 2.1.2 Just-In-Time Recompilation . 21 2.1.3 Linking and Indirect Branch Handling . 23 2.1.4 Trace Optimisation . 24 2.1.5 Dynamic Binary Instrumentation . 24 2.2 Static Binary Analysis . 26 2.2.1 Binary Abstraction . 27 2.2.2 Dependence Analysis . 29 2.3 Automatic Parallelisation . 34 2.3.1 Independent Multi-threading . 34 2.3.2 Cyclic Multi-threading . 35 2.3.3 Pipelined Multi-threading . 37 2.3.4 Polyhedral Multi-threading . 37 2.3.5 Speculative Multi-threading . 38 2.3.5.1 Thread Level Speculation . 39 2.3.5.2 Transactional Memory . 40 2.3.6 Profile Guided Multi-threading . 42 2.4 Summary . 43 3 Recompiling Binaries As Instructed 45 3.1 System Overview . 45 3.2 Design Choices . 47 3.2.1 Dynamic Binary Translation . 47 3.2.2 Static Binary Analysis . 48 3.2.3 Hint Instruction Interface . 48 3.3 Guided Binary Recompilation . 49 3.3.1 Guided IR Modification . 50 3.3.1.1 Case Study: JIT Prefetching Recompilation . 50 3.3.2 Guided Binary Instrumentation . 55 3.3.3 Partial Static Recompilation . 56 3.3.3.1 Case Study: Binary Vectorisation . 58 3.4 Related Work . 59 3.5 Summary . 60 4 Uncovering Parallelism In Binaries 63 4.1 Demand-driven Instrumentation . 63 4.1.1 Binary Emulator For Estimating Parallelism . 64 4.1.2 Benchmarks . 66 4.1.3 Loop Coverage Profiling . 66 4.1.4 Dynamic Data Dependence Profiling . 68 4.2 Ideal Parallel Execution Models . 70 4.2.1 DOACROSS Dataflow Model . 70 4.2.2 Induction/Reduction Optimisation . 72 4.2.3 Code Motion Model . 74 4.3 Realistic Parallel Execution Model . 78 4.3.1 Synchronisation Model . 79 4.3.1.1 Ambiguous Static Binary Analysis . 80 4.3.1.2 Thread Communication Latency . 83 4.3.2 Thread-level Speculation Model . 83 4.4 Related Work . 86 4.5 Summary . 87 5 Automatic Binary Parallelisation Framework 89 5.1 System Overview . 90 5.2 Static Hint Generation . 93 5.2.1 Loop Recognition . 93 5.2.2 Dependence and Alias Analysis . 94 5.2.3 Loop Characterisation and Selection . 98 5.2.4 Hint Generation . 99 5.3 Thread Management . 99 5.3.1 Threading States . 100 5.3.2 Thread Privatisation . 104 5.3.2.1 Thread Local Storage . 104 5.3.2.2 Register and Stack Privatisation . 105 5.3.2.3 Heap Privatisation . 106 5.4 DOALL Loop Parallelisation . 107 5.4.1 Block Parallelisation . 108 5.4.2 Cyclic Parallelisation . 108 5.5 Resolving Runtime Data Dependencies . 109 5.5.1 Just-In-Time Software Transactional Memory . 109 5.5.1.1 Read and Write Buffer . 110 5.5.1.2 Hint-Guided JIT Speculation . 110 5.5.1.3 Speculative Signal Handlers . 113 5.5.2 Speculative Value Prediction . 114 5.5.2.1 Guided Speculative Value Prediction . 115 5.5.2.2 Speculation or Synchronisation . 115 5.5.2.3 Versioned Signal and Wait . 116 5.6 Generic Loop Parallelisation . 117 5.6.1 Hint Generation Strategy . 118 5.6.2 Correctness and Verification . 119 5.6.2.1 Static Consistency Verification . 119 5.6.2.2 Dynamic Runtime Validation . 119 5.7 Related Work . 120 5.8 Summary . 121 6 System Evaluation 123 6.1 Experimental Setup . 123 6.2 Performance Evaluation . 125 6.2.1 Overhead Analysis . 126 6.2.2 Machine and System Variance . 128 6.2.3 Cyclic vs Block Parallelisation . 131 6.3 Irregular Loop Evaluation . 133 6.4 Performance Comparison and Related Work . 136 6.4.1 Comparison with Kotha on PolyBench . 136 6.4.2 Comparison with gcc autopar ..................... 137 6.5 Summary . 138 7 Conclusion and Future Work 139 7.1 Contribution . 139 7.1.1 Guided Binary Recompilation . 139 7.1.2 Binary Emulator for Estimating Parallelism . 140 7.1.3 Guided Automatic Binary Parallelisation . 140 7.2 Future Work . 141 7.2.1 Standardisation . ..