Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory

Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 5, 2015 Blacksburg, Virginia Keywords: Transaction Memory, Software Transaction Memory (STM), Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems Copyright 2015, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. On the other hand, most applications are designed or written for single core processors; so they do not benefit from the underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering. Transactional memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine- grained locking performance. In this dissertation, we exploit TM as an optimistic execution approach for transforming a sequential application into parallel. We design and implement two frameworks that support automatic parallelization: Lerna and HydraVM. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data dependency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler. Correctness of the program is preserved through exploit- ing software transactional memory (STM) to manage concurrent and out-of-order memory accesses. Lerna is a compiler framework that automatically and transparently detects and extracts parallelism from sequential code through a set of techniques including code profiling, in- strumentation, and adaptive execution. Lerna is cross-platform and independent of the programming language. The parallel execution exploits memory transactions to manage concurrent and out-of-order memory accesses. This scheme makes Lerna very effective for sequential applications with data sharing. Additionally, we introduce the general conditions for embedding any transactional memory algorithm into our system. While prior research shows that transactions must commit in order to preserve program se- mantics, placing the ordering enforces scalability constraints at large number of cores. Our first major post-preliminary research goal is to eliminate the need for ordering transactions without affecting program consistency. We propose building a cooperation mechanism in which transactions can forward some changes safely. This approach eliminates false conflicts and increases concurrency opportunities. As our second contribution, we propose transactional checkpointing as a technique for reducing the wasted processing time in abort and code retrial. With checkpointing, a transaction can be partially aborted and only re-execute the invalid portion. iii Contents 1 Introduction1 1.1 Motivation.....................................2 1.2 Current Research Contributions.........................2 1.2.1 Transactional Parallelization.......................3 1.2.2 HydraVM.................................4 1.2.3 Lerna...................................5 1.3 Proposed Post-Prelim Work...........................6 1.4 Outline.......................................6 2 Background8 2.1 Manual Parallelization.............................. 10 2.2 Automatic Parallelization............................ 12 2.3 Thread Level Speculation............................ 12 2.4 Transactional Memory.............................. 14 2.4.1 NOrec................................... 15 2.4.2 TinySTM................................. 16 2.5 Parallelism Limits and Costs........................... 17 3 Past & Related Work 19 3.1 Transactional Memory.............................. 19 3.2 Parallelization................................... 20 3.3 Optimistic Concurrency............................. 21 iv 3.3.1 Thread-Level Speculation........................ 22 3.3.2 Parallelization using Transactional Memory.............. 23 3.4 Comparison with existing work......................... 24 4 HydraVM 26 4.1 Program Reconstruction............................. 27 4.2 Transactional Execution............................. 29 4.3 Jikes RVM..................................... 31 4.4 System Architecture............................... 32 4.4.1 Bytecode Profiling............................ 34 4.4.2 Trace detection.............................. 34 4.4.3 Parallel Traces.............................. 36 4.4.4 Reconstruction Tuning.......................... 37 4.4.5 Misprofiling................................ 38 4.5 Implementation.................................. 38 4.5.1 Detecting Real Memory Dependencies................. 38 4.5.2 Handing Irrevocable Code........................ 39 4.5.3 Method Inlining.............................. 39 4.5.4 ByteSTM................................. 40 4.5.5 Parallelizing Nested Loops........................ 41 4.6 Experimental Evaluation............................. 43 4.7 Discussion..................................... 45 5 Lerna 46 5.1 Challenges..................................... 47 5.2 Low-Level Virtual Machine............................ 49 5.3 General Architecture and Workflow....................... 50 5.4 Code Profiling................................... 53 5.5 Program Reconstruction............................. 53 v 5.5.1 Dictionary Pass.............................. 53 5.5.2 Builder Pass................................ 55 5.5.3 Transactifier Pass............................. 57 5.6 Transactional Execution............................. 59 5.6.1 Ordered NOrec.............................. 60 5.6.2 Ordered TinySTM............................ 61 5.6.3 Irrevocable Transactions......................... 61 5.6.4 Transactional Increment......................... 61 5.7 Adaptive Runtime................................ 63 5.7.1 Batch Size................................. 63 5.7.2 Jobs Tiling and Partitioning....................... 64 5.7.3 Workers Selection............................. 64 5.7.4 Manual Tuning.............................. 65 5.8 Evaluation..................................... 66 5.8.1 Micro-benchmarks............................ 67 5.8.2 The STAMP Benchmark......................... 71 5.8.3 The PARSEC Benchmark........................ 77 5.9 Discussion..................................... 79 6 Conclusions & Post-Preliminary Work 80 6.1 Proposed Post-Prelim Work........................... 81 6.1.1 Analysing Ordered Commit....................... 81 6.1.2 Transaction Checkpointing........................ 85 vi List of Figures 2.1 Loops classification according to inter-iterations dependency.........9 2.2 Running traces over three processors...................... 10 2.3 OpenMP code snippet.............................. 11 2.4 Example of thread-level speculation over four processors........... 13 2.5 An example of transactional code using atomic TM constructs........ 15 2.6 Maximum Speedup according to Amdahl's and Gustafson's Lows....... 17 4.1 Program Reconstruction as Jobs......................... 28 4.2 Parallel execution pitfalls: (a) Control Flow Graph, (b) Possible parallel execution scenario, and (c) Managed TM execution................. 30 4.3 Transaction States................................ 30 4.4 HydraVM Architecture.............................. 33 4.5 Matrix Multiplication............................... 35 4.6 Program Reconstruction as a Producer-Consumer Pattern.......... 37 4.7 3x3 Matrix Multiplication Execution using Traces Generated by profiling 2x2 Matrix Multiplication............................... 38 4.8 Static Single Assignment form Example..................... 39 4.9 Nested Traces................................... 42 4.10 HydraVM Speedup................................ 44 5.1 Lerna's Loop Transformation: from Sequential to Parallel........... 48 5.2 LLVM Three Layers Design........................... 49 5.3 Lerna's Architecture and Workflow....................... 51 vii 5.4 The LLVM Intermediate Representation using SSA form of Figure 5.1a.... 54 5.5 Natural, Simple and Transformed Loop..................... 56 5.6 Symmetric vs Normal Transactions....................... 57 5.7 Conditional Counters............................... 62 5.8 Workers Manager................................. 64 5.9 ReadNWrite1 Benchmark............................. 67 5.10 ReadWriteN Benchmark.............................. 69 5.11 MCAS Benchmark................................. 70 5.12 Adaptive workers selection............................ 71 5.13 Kmeans and Genome Benchmarks........................ 72 5.14 Vacation and SSCA2 Benchmarks........................ 74 5.15 Labyrinth and Intruder Benchmarks....................... 75 5.16 Effect of Tiling on abort and speedup using 8 workers and Genome...... 76 5.17 Kmeans performance with user intervention.................. 76 5.18 PARSEC Benchmarks.............................. 78 6.1 The Execution of Ordered Transactions over 4 and 8 threads......... 82 6.2 Total delay time (δ) of

Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory

High-Performance Computing and Compiler Optimization

The Intel X86 Microarchitectures Map Version 2.0

The Paramountcy of Reconfigurable Computing

The Intel X86 Microarchitectures Map Version 2.2

Redalyc.Optimization of Operating Systems Towards Green Computing

High-Performance Parallel Computing

The Intel X86 Microarchitectures Map Version 1.1

Analysis of Task Scheduling for Multi-Core Embedded Systems

Thesis Title

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1

GPU Programming