Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 25, 2016 Blacksburg, Virginia Keywords: Transaction Memory, Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems, Age Commitment Order, Low-Level TM Semantics, TM Friendly Semantics Copyright 2016, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine-grained locking performance. This dissertation exploits TM as an optimistic execution approach for transforming a sequential application into parallel. The design and the implementation of two frameworks that support automatic parallelization: Lerna and HydraVM, are proposed, along with a number of algorithmic optimizations to make the parallelization effective. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data depen- dency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler. Correctness of the program is preserved through exploiting Software Transactional Memory (STM) to manage concurrent and out-of-order memory accesses. Our experiments show that HydraVM achieves speedup between 2×-5× on a set of benchmark applications. Lerna is a compiler framework that automatically and transparently detects and extracts parallelism from sequential code through a set of techniques including code profiling, in- strumentation, and adaptive execution. Lerna is cross-platform and independent of the programming language. The parallel execution exploits memory transactions to manage concurrent and out-of-order memory accesses. This scheme makes Lerna very effective for sequential applications with data sharing. This thesis introduces the general conditions for embedding any transactional memory algorithm into Lerna. In addition, the ordered version of four state-of-art algorithms have been integrated and evaluated using multiple benchmarks including RSTM micro benchmarks, STAMP and PARSEC. Lerna showed great results with average 2.7× (and up to 18×) speedup over the original (sequential) code. While prior research shows that transactions must commit in order to preserve program semantics, placing the ordering enforces scalability constraints at large number of cores. In this dissertation, we eliminates the need for commit transactions sequentially without affecting program consistency. This is achieved by building a cooperation mechanism in which transactions can forward some changes safely. This approach eliminates some of the false conflicts and increases the concurrency level of the parallel application. This thesis proposes a set of commit order algorithms that follow the aforementioned approach. Interestingly, using the proposed commit-order algorithms the peak gain over the sequential non-instrumented execution in RSTM micro benchmarks is 10× and 16.5× in STAMP. Another main contribution is to enhance the concurrency and the performance of TM in general, and its usage for parallelization in particular, by extending TM primitives. The extended TM primitives extracts the embedded low level application semantics without affecting TM abstraction. Furthermore, as the proposed extensions capture common code patterns, it is possible to be handled automatically through the compilation process. In this work, that was done through modifying the GCC compiler to support our TM extensions. Results showed speedups of up to 4× on different applications including micro benchmarks and STAMP. Our final contribution is supporting the commit-order through Hardware Transactional Mem- ory (HTM). HTM contention manager cannot be modified because it is implemented inside the hardware. Given such constraint, we exploit HTM to reduce the transactional execution overhead by proposing two novel commit order algorithms, and a hybrid reduced hardware algorithm. The use of HTM improves the performance by up to 20% speedup. This work is supported by VT-MENA program. iii Dedication To my wife, my parents, and my grandma. iv Acknowledgments I would like to thank my advisor, Dr. Binoy Ravindran, for all his help and support. Also, I would like to thank my friend, mentor and co-advisor, Dr. Roberto Palmieri, for his guidance, dedication and efforts. I would also like to thank my committee members: Dr. Robert P. Broadwater, Dr. Anil Kumar S. Vullikanti, Dr. Paul E. Plassmann and Dr. Sedki Mohamed Riad, for their guidance, feedbacks and advice. It is a great honor for me to have them serving in my committee. Many thanks to all my friends in the Systems Software Research Group for their support, assistance and cooperation. They include, Dr. Ahmed Hassan, Dr. Mohamed Mohamedin, Dr. Sandeep Hans, Dr. Sebastiano Peluso, Dr. Alexandru Turcu, Joshua Bockenek, Marina Sadini, and many others. Also, special thanks to Dr. Mohamed E. Khalefa for the fruitful discussions and his valuable inputs that had considerable influence in shaping this work. Finally, I would like to thank my wife, Shaimaa, for her endless love, understanding and patience. Also, I would like to thank my parents for their encouragement and support. v Contents 1 Introduction1 1.1 Motivation.....................................2 1.1.1 Manual Parallelization..........................2 1.1.2 Automatic Parallelization........................3 1.2 Contributions...................................7 1.2.1 HydraVM.................................9 1.2.2 Lerna................................... 10 1.2.3 Commitment Order Algorithms..................... 10 1.2.4 TM-Friendly Semantics.......................... 11 1.3 Thesis Organization................................ 12 2 Background 13 2.1 Manual Parallelization.............................. 15 2.2 Automatic Parallelization............................ 17 2.3 Thread Level Speculation............................ 17 2.4 Transactional Memory.............................. 18 2.4.1 NOrec................................... 20 2.4.2 TinySTM................................. 21 2.5 Parallelism Limits and Costs........................... 22 3 Past & Related Work 24 3.1 Transactional Memory.............................. 24 vi 3.2 Parallelization................................... 25 3.3 Optimistic Concurrency............................. 26 3.3.1 Thread-Level Speculation........................ 27 3.3.2 Parallelization using Transactional Memory.............. 28 3.4 Comparison with existing work......................... 28 4 HydraVM 30 4.1 Program Reconstruction............................. 31 4.2 Transactional Execution............................. 33 4.3 Jikes RVM..................................... 35 4.4 System Architecture............................... 36 4.4.1 Bytecode Profiling............................ 38 4.4.2 Trace detection.............................. 38 4.4.3 Parallel Traces.............................. 40 4.4.4 Reconstruction Tuning.......................... 41 4.4.5 Misprofiling................................ 42 4.5 Implementation.................................. 42 4.5.1 Detecting Real Memory Dependencies................. 42 4.5.2 Handing Irrevocable Code........................ 43 4.5.3 Method Inlining.............................. 43 4.5.4 ByteSTM................................. 44 4.5.5 Parallelizing Nested Loops........................ 45 4.6 Experimental Evaluation............................. 47 4.7 Discussion..................................... 49 5 Lerna 50 5.1 Challenges..................................... 51 5.2 Low-Level Virtual Machine............................ 53 5.3 General Architecture and Workflow....................... 54 vii 5.4 Code Profiling................................... 57 5.5 Program Reconstruction............................. 57 5.5.1 Dictionary Pass.............................. 57 5.5.2 Builder Pass................................ 59 5.5.3 Transactifier Pass............................. 61 5.6 Transactional Execution............................

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Slicing (Draft)

High-Performance Computing and Compiler Optimization

The Intel X86 Microarchitectures Map Version 2.0

The Paramountcy of Reconfigurable Computing

Assessing Gains from Parallel Computation on a Supercomputer

Instruction Level Parallelism Example

Minimizing Startup Costs for Performance-Critical Threading

The Intel X86 Microarchitectures Map Version 2.2

Redalyc.Optimization of Operating Systems Towards Green Computing

18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

Performance of Physics-Driven Procedural Animation of Character Locomotion for Bipedal and Quadrupedal Gait

Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory