Implicitly-Threaded Parallelism in Manticore

Implicitly-threaded Parallelism in Manticore Matthew Fluet Mike Rainey John Reppy Adam Shaw Toyota Technological Institute at Chicago University of Chicago fl[email protected] fmrainey,jhr,[email protected] Abstract to medium-scale parallel applications; the rationale for this focus + The increasing availability of commodity multicore processors is has been stated at greater length in prior work [FRR 07]. making parallel computing available to the masses. Traditional par- A homogeneous language design is not likely to take full advan- allel languages are largely intended for large-scale scientific com- tage of the hardware resources available. For example, a language puting and tend not to be well-suited to programming the applica- that provides data parallelism but not explicit concurrency will be tions one typically finds on a desktop system. Thus we need new inconvenient for the development of the networking and GUI com- parallel-language designs that address a broader spectrum of appli- ponents of a program. On the other hand, a language that provides cations. In this paper, we present Manticore, a language for building concurrency but not data parallelism will be ill-suited to the com- parallel applications on commodity multicore hardware including a ponents of a program that demand fine-grained parallelism, such as diverse collection of parallel constructs for different granularities of image processing and particle systems. work. We focus on the implicitly-threaded parallel constructs in our Our belief is that parallel programming languages must provide high-level functional language. We concentrate on those elements mechanisms for multiple levels of parallelism, both because appli- that distinguish our design from related ones, namely, a novel paral- cations exhibit parallelism at multiple levels and because hardware lel binding form, a nondeterministic parallel case form, and excep- requires parallelism at multiple levels to maximize performance. tions in the presence of data parallelism. These features differenti- Indeed, a number of research projects are exploring heterogeneous ate the present work from related work on functional data parallel parallelism in languages that combine support for parallel com- language designs, which has focused largely on parallel problems putation at different levels into a common linguistic and execu- with regular structure and the compiler transformations — most tion framework. The Glasgow Haskell Compiler [GHC] has been notably, flattening — that make such designs feasible. We describe extended with support for three different paradigms for parallel our implementation strategies and present some detailed examples programming: explicit concurrency coordinated with transactional utilizing various mechanisms of our language. memory [PGF96, HMPH05], semi-implicit concurrency based on annotations [THLP98], and data parallelism [CLP+07], inspired Categories and Subject Descriptors D.3.0 [Programming Lan- by NESL [BCH+94, Ble96]. Manticore [FRR+07, FFR+07] in- guages]: General; D.3.2 [Programming Languages]: Language corporates both coarse-grained parallelism and fine-grained nested Classifications—Applicative (functional) languages; Concurrent, parallelism. Manticore’s coarse-grained parallelism is based on distributed, and parallel languages; D.3.3 [Programming Lan- Concurrent ML (CML) [Rep91], which provides explicit concur- guages]: Language Constructs and Features—Concurrent program- rency and synchronous-message passing. Manticore’s fine-grained ming structures nested parallelism is based on previous nested data-parallel languages, such as NESL [BCH+94, Ble96, BG96] and Nepal [CK00, General Terms Design, Languages CKLP01, LCK06]), and provides data parallel arrays and parallel Keywords implicitly-threaded parallelism, data parallelism, par- comprehensions, among other mechanisms. allel binding, parallel case, exceptions After an overview of the Manticore language (Section 2), we present four main technical contributions: 1. Introduction • the pval binding form, for parallel evaluation and speculation The laws of physics and the limitations of instruction-level paral- (Section 3), lelism are forcing microprocessor architects to develop new multi- • the pcase expression form, for nondeterminism and user- core processor designs. As a result, parallel computing is becom- defined parallel control structures (Section 4), ing widely available on commodity hardware. Ideal applications • the inclusion of exceptions and exception handlers in a data for this hardware, such as multimedia processing, computer games, parallel context (Section 5), and and small-scale simulations, can exhibit parallelism at multiple levels with different granularities. Our design is targeted at these small • our implementation strategies for all of the above, as well as our approach to handling exceptions in the context of parallel arrays (Section 7). We exercise our design on a series of examples in Section 6. Permission to make digital or hard copies of all or part of this work for personal or The examples are meant to highlight Manticore’s suitability for classroom use is granted without fee provided that copies are not made or distributed irregular parallel applications, in contrast to the embarrassingly for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute parallel examples that often appear in the data parallelism literature. to lists, requires prior specific permission and/or a fee. We review related work and conclude in Sections 8 and 9. ICFP’08, September 22–24, 2008, Victoria, BC, Canada. Copyright c 2008 ACM 978-1-59593-919-7/08/09. $5.00. 2. An Overview of the Manticore Language CML, Manticore supports threads that are explicitly created using spawn Parallel language mechanisms can be roughly grouped into three the primitive. Threads do not share mutable state; rather categories: they use synchronous message passing over typed channels to com- municate and synchronize. Additionally, we use CML communi- • implicit parallelism, where the compiler and runtime system cation mechanisms to represent the interface to imperative features are responsible for partitioning the computation into paral- such as input/output. lel threads. Examples of this approach include Id [Nik91], Further discussion of explicitly-threaded parallelism in Manti- pH [NA01], and Sisal [GDF+97]. core is beyond the scope of this paper; other recent publications • implicit threading, where the programmer provides annotations, (notably [RX08]) provide more detail. or hints to the compiler, as to which parts of the program are 2.3 Implicitly-threaded Parallelism profitable for parallel evaluation, but the mapping of computation onto parallel threads is left to the compiler and runtime Manticore provides implicitly-threaded parallel versions of a num- system. Examples include NESL [Ble96] and its descendant ber of sequential forms. These constructs can be viewed as hints Nepal [CKLP01], later renamed Data Parallel Haskell (“DPH” to the compiler about which computations are good candidates for below). parallel execution; the semantics of many of these constructs is sequential and the compiler and/or runtime system may choose to • explicit threading, where the programmer explicitly creates execute them in a single thread. parallel threads. Examples include CML [Rep99] and Er- Having a sequential semantics is useful in two ways: it gives the lang [AVWW96]. programmer a deterministic programming model and it formalizes These different design points represent different trade-offs between the expected behavior of the compiler. It also requires the compiler programmer effort and programmer control. Automatic techniques to verify that the individual subcomputations in a parallel compu- for parallelization have proven effective for dense regular parallel tation do not send or receive messages (if the computations are ac- computations (e.g., dense matrix algorithms), but have been less tually to be executed in parallel). Similarly, if a subcomputation successful for irregular problems. Manticore provides both implicit raises an exception, the implementation must delay the delivery threading and explicit threading mechanisms. The former supports of the exception until all sequentially prior computations have ter- fine-grained parallel computation, while the latter supports coarse- minated. Both of these restrictions can be efficiently implemented grained parallel tasks and explicit concurrent programming. These only through appropriate program analyses. parallelism mechanisms are built on top of a sequential functional language. In the sequel, we discuss each of these in turn, starting Parallel Arrays Support for parallel computations on arrays and with the sequential base language. Space precludes giving a com- matrices is common in parallel languages. In Manticore, we sup- plete account of Manticore’s language-design philosophy, goals, port such computations using the nested parallel array mechanism and target domain; we refer the interested reader to our previous inspired by NESL and developed further by Nepal and DPH. A publications [FRR+07, FFR+07]. parallel array expression has the form [|e1, :::, en|] 2.1 Sequential Programming which constructs an array of n elements. The delimiters [| |] alert Manticore’s sequential core is based on a subset of Standard ML the compiler that the ei may be evaluated in parallel. (SML).

Implicitly-Threaded Parallelism in Manticore

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Parallel Computing a Key to Performance

Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions

Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions

Implicit Parallelism, Mixed Approaches

Lawrence Berkeley National Laboratory Recent Work

14. Parallel Computing 14.1 Introduction 14.2 Independent

Implicitly-Multithreaded Processors

Parallel Programming – Multicore Systems

MT 2015 21011433 8070.Pdf

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT (Revision 1.4 — June 2007 Release)

SCS5108 Parallel Systems Unit II Parallel Programming