Implicit Parallelism with Ordered Transactions

Implicit Parallelism with Ordered Transactions ¡ Christoph von Praun Luis Ceze Calin˘ Cas¸caval ¡ IBM T.J. Watson Research Center Department of Computer Science ¢ praun, cascaval £ @us.ibm.com University of Illinois at Urbana-Champaign [email protected] Abstract vantage of this approach is that users do not get involved, and, Implicit Parallelism with Ordered Transactions (IPOT) is an ex- at least theoretically, one can convert legacy sequential applica- tension of sequential or explicitly parallel programming models tions to exploit multicore parallelism. However, after decades of to support speculative parallelization. The key idea is to specify research, automatic parallelization works for regular loops in scien- opportunities for parallelization in a sequential program using an- tific codes. Workloads for multicore systems include other classes notations similar to transactions. Unlike explicit parallelism, IPOT of applications for which automatic parallelization has been tradi- annotations do not require the absence of data dependence, since tionally unsuccessful [14, 30]. the parallelization relies on runtime support for speculative execu- A second approach is to specify parallelism explicitly. Multi- tion. IPOT as a parallel programming model is determinate, i.e., threading with shared memory is a popular paradigm, since it nat- program semantics are independent of the thread scheduling. For urally maps onto shared memory multiprocessors and can be re- optimization, non-determinism can be introduced selectively. garded as an extension of sequential programming. Nevertheless We describe the programming model of IPOT and an online tool this paradigm is problematic since it invites new and subtle er- that recommends boundaries of ordered transactions by observing a rors due to synchronization defects [18, 6], which are difficult to sequential execution. On three example HPC workloads we demon- debug. Besides correctness concerns, explicitly parallel programs strate that our method is effective in identifying opportunities for commonly suffer from scalability and portability problems. fine-grain parallelization. Using the automated task recommenda- The third approach is to simplify parallelization through spec- tion tool, we were able to perform the parallelization of each pro- ulative execution. The most popular types of such systems are gram within a few hours. Thread Level Speculation (TLS) [15, 27] and Transactional Mem- ory (TM) [10]. Their main characteristic is the ability to execute Categories and Subject Descriptors D.1.3 [Software]: Concur- sections of code in a 'sand box' where updates of are buffered and rent Programming; C.1.4 [Processor Architectures]: Parallel Ar- memory access is checked for conflicts among concurrent threads. chitectures There are several proposals that implement TLS in hardware and provide compiler support to extract speculative tasks automatically. General Terms Design, Languages, Performance While this technique allows parallel execution, it does not come for Keywords parallel programming, program parallelization, im- free: hardware support for speculation can be quite complex [25] plicit parallelism, thread-level speculation, transactional memory, and speculative multitasking is typically less efficient than the exe- ordered transactions cution of explicitly parallel programs [13]. In this paper we present an approach that integrates mechanisms for speculative parallelization into a programming language. In this 1. Introduction environment, called Implicit Parallelism with Ordered Transactions The current trend in processor architecture is a move toward mul- (IPOT), programmers can quickly convert sequential applications ticore chips. The reasons are multiple: the number of available or scale explicitly parallel programs to take advantage of multi- transistors is increasing, the power budget for superscalar proces- core architectures. IPOT provides constructs to specify opportu- sors is limiting frequency and thus limiting serial execution perfor- nities for speculative parallelization and semantic annotations for mance [2, 24]. Parallel execution will be required to leverage the program variables. The programming environment, i.e., compiler, performance potential of upcoming processor generations. The key runtime system, and hardware can take advantage of this informa- challenge posed by this trend is to simplify parallel programming tion to choose and schedule speculative tasks and to optimize the technology. management of shared data. We require that the execution environ- One approach is automatic program parallelization. A compiler ment supports speculative execution. The focus of this paper is on analyzes the source code and extracts parallel loops. The main ad- the IPOT programming model and tools, however, we will briefly discuss required architecture and compiler support. IPOT borrows from TM, since units of parallel work execute under atomicity and isolation guarantees. Moreover, IPOT inherits Permission to make digital or hard copies of all or part of this work for personal or commit ordering from TLS, hence ordered transactions. The key classroom use is granted without fee provided that copies are not made or distributed idea is that ordering enables sequential reasoning for the program- for profit or commercial advantage and that copies bear this notice and the full citation mer without precluding concurrency (in the common case) on a on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. runtime platform with speculative execution. PPoPP'07 March 14–17, 2007, San Jose, California, USA. Copyright ¤ c 2007 ACM 978-1-59593-602-8/07/0003. $5.00 ¥§¦©¨ !"#© %$&(' ¥§¦©¨k l m n ! tryasync " © %$& Figure 1. Simple loop to execute an ordered sequence of tasks. ' )*¦¦+,§-©.§¦%§/10321445 6 7++©897++,¥7+:©/ ¥ 6 ©: ;#¥§¦©¨/7§< ;= ?>*¦* @>A0B 2C#0D1CE*F*G2 !," Figure 3. Ordered task sequence with IPOT. ¥H I#J! 7K§¦LM ©<@"N© %$&§-©.§¦%§/10B 25@¨KO§/(' /+:©/ can execute efficiently, i.e., execution is parallel if there are no de- ;§/©= Q§-©.§¦%§/10B %F*G2 !R"@© %$&(§-©.§¦%§/10B 2#R¨KO§/S' P ' pendences, and serial if there are dependences. IPOT enables ordering constraints to be tracked by the execution platform and thus facilitates highly efficient synchronization Figure 2. Explicitly parallel task sequence ordered through condi- among tasks. Beyond ordering, Section 2.2 illustrates how IPOT tional critical sections. can avoid and resolve conflicts among transactions that are data de- pendent. To summarize, this paper makes the following contributions: 2.2 Language extensions T a parallel programming model, IPOT, that extends sequential IPOT can be embedded in a sequential programming language or parallel programming languages with support for speculative with language extensions that are described below. These exten- multithreading and transactional execution; sions support the compiler and runtime in achieving a good paral- T lelization and successful speculation. This section discusses anno- a set of constructs that can selectively relax the determinacy tations for the speculative parallelization of sequential programs. properties of an IPOT program and improve execution perfor- Section 2.7 introduces IPOT extensions for non-speculative paral- mance; lelization. T an algorithm that infers task recommendations from the execu- IPOT constructs for speculative parallelization have the nature tion of a sequential program; of hints, i.e., any single annotation can be left out without altering T a tool that estimates the execution speedup of a sequential program semantics. Conversely, most IPOT annotations do not af- program annotated with IPOT directives. fect the program semantics in a way that deviates from the sequential execution; exceptions are discussed in detail. We believe that these properties make IPOT attractive and simple to use in prac- 2. Programming model tice. 2.1 Example Transaction boundaries The goal of IPOT is to simplify the parallelization of a sequential r We use c _ oâ6[oJZMbqp stmts to denote a block of code which execu- thread of execution. To motivate our approach we use the generic tion is likely data independent from the serial execution context in example of a sequential loop that executes a series if tasks UV which it is embedded. An execution instance of the statement block U (Figure 1). We consider the case where the tasks V may have is called a “speculative task”, or task for short. Since tasks may ex- (carried) data dependences, hence concurrency control must be ecute concurrently, we refer to them as ordered transactions. Since used to coordinate the execution of tasks if the loop is parallelized. we consider sequential programs in this section, the term 'trans- We employ features from the X10 programming language [4] to action' refers to the execution principle and not to a mechanism illustrate the parallelization and assume transactional memory as for explicit concurrency control. Program execution is done on a the mechanism for concurrency control. runtime substrate that executes tasks speculatively and guarantees that the observable effects follow the sequential execution order Parallelization with explicit concurrency (isolation) and that no partial effects of a task are visible to concur- In the program Figure 2, WYXZYX [\=W^]*_ `6a6b\ achieves a fork-join rent observers (atomicity). The latter aspect is only relevant when parallelization of the loop

Implicit Parallelism with Ordered Transactions

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Parallel Computing a Key to Performance

Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions

Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions

Implicit Parallelism, Mixed Approaches

Lawrence Berkeley National Laboratory Recent Work

14. Parallel Computing 14.1 Introduction 14.2 Independent

Implicitly-Multithreaded Processors

Parallel Programming – Multicore Systems

MT 2015 21011433 8070.Pdf

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT (Revision 1.4 — June 2007 Release)

SCS5108 Parallel Systems Unit II Parallel Programming