Lazy Threads Compiler and Runtime Structures For

Lazy Threads Compiler and Runtime Structures for FineGrained Parallel Programming by Seth Cop en Goldstein BSE Princeton University MSE University of CaliforniaBerkeley A dissertation submitted in partial satisfaction of the requirements for the degree of Do ctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY Committee in charge Professor David E Culler Chair Professor Susan L Graham Professor Paul McEuen Professor Katherine A Yelick Fall Abstract Lazy Threads Compiler and Runtime Structures for FineGrained Parallel Programming by Seth Cop en Goldstein Do ctor of Philosophy in Computer Science University of California Berkeley Professor David E Culler Chair Many mo dern parallel languages supp ort dynamic creation of threads or require multithreading in their implementations The threads describ e the logical parallelism in the program For ease of expression and b etter resource utilization the logical parallelism in a program often exceeds the physical parallelism of the machine and leads to applications with many negrained threads In practice however most logical threads need not b e indep endent threads Instead they could b e run as sequential calls which are inherently cheap er than indep endent threads The challenge is that one cannot generally predict which logical threads can b e implemented as sequential calls In lazy multithreading systems each logical thread b egins execution sequentially with the attendant ecientstack management and direct transfer of control and data Only if a thread truly must execute in parallel do es it get its own thread of control This dissertation presents new implementation techniques for lazy multithreading systems on conventional machines and compares a varietyofdesignchoices Wedevelop an abstract machine that makes explicit the design decisions for achieving lazy multithread ing Weintro duce new options on each of the four axes in the design space the storage mo del the thread representation the disconnection metho d and the queueing mechanism Stacklets our new storage mo del allows parallel calls to maintain the invariants of se quential calls Thread seeds our new thread representation allows threads to b e stolen without requiring thread migration or shared memory Lazydisconnect our new discon nection metho d do es not restrict the use of p oin ters Implicit and Lazy queueing our two new queueing mechanisms eliminate the need for explicit b o okkeeping Additionallywe develop a core set of compilation techniques and runtime primitives that form the basis for the ecient implementation of any design p oint Wehaveevaluated the dierent approaches by incorp orating them into a compiler for an explicitly parallel extension of SplitC Weshow that there exist p oints in the design space eg stacklet thread seeds lazydisconnect and lazy queueing for which negrained parallelism can b e eciently supp orted even on distributed memory machines thus allowing programmers freedom to sp ecify the parallelism in a program without concern for excessive overhead iii Contents List of Figures vii List of Tables xi Intro duction Motivation The Goal Contributions Road Map Multithreaded Systems The Multithreaded Mo del Using Fork for Thread Creation Dening the Potentially Parallel Call Problem Statement Multithreaded Abstract Machine The Multithreaded Abstract Machine MAM Threads Pro cessors and MAM Threads Op erations Inlets and Inlet Op erations Thread Scheduling Discussion Sequential Call and Return MAMDFSupp orting Direct Fork The MAMDF Sc heduler Thread Op erations in MAMDF Continuation Stealing Thread Seeds and Seed Activation Closures Discussion MAMDSSupp orting Direct Return MAMDS Op erations CONTENTS iv Discussion The Lazy Multithreaded Abstract Machine LMAM Disconnection and the Storage Mo del Eagerdisconnect Lazydisconnect Summary Storage Mo dels Storing a Threads Internal State Remote Fork Linked Frames Op erations on Linked Frames Linked Frame Timings Multiple Stacks StackLayout and Stubs Stack Implementation Op erations on Multiple Stacks Multiple Stacks Timings Spaghetti Stacks Spaghetti StackLayout Spaghetti Stack Op erations Spaghetti Stack Timings Stacklets Stacklet Layout Stacklet Op erations Stacklet Stubs Compilation Timings Discussion Mixed Storage Mo del The Memory System Summary Implementing Control Representing Threads Sequential Call The Parallel Ready Sequential Call Seeds Closures Summary The Ready and Parent Queues The Ready Queue The Parent Queue The Implicit Queue and Implicit Seeds The Explicit Queue and Explicit Seeds CONTENTS v Interaction Between the Queue and Susp ension Disconnection and ParentControlled Return Continuations Compilation Strategy Lazydisconnect Eagerdisconnect Synchronizers Lazy Parent Queue Costs in the Seed Mo del The Implicit Parent Queue The Explicit Parent Queue The Lazy Parent Queue Continuation Stealing Control Disconnection Migration Integrating the Control and Storage Mo dels Linked Frames Stacks and Stacklets Spaghetti Stacks Discussion Programming Languages SplitCthreads Example SplitCthreads Program Thread Extensions to SplitC Forkset statement Pcall Statement Fork statement Start Statement Susp end Statement Yield Statement Function Typ es Manipulating Threads SplitPhase Memory Op erations and Threads Other Examples Lazy Thread Style IStructures and Strands SplitCThreads Compiler Id Empirical Results Exp erimental Setup Compiler Integration Eager Threading Comparison to Sequential Co de Register Windows vs Flat Register Mo del CONTENTS vi Overall Performance Comparisons Comparing Memory Mo dels Thread Representations Disconnection Queueing Running on the NOW Using CopyonSusp end Susp end and Steal Stream Entry Points Comparing Co de Size Id Comparison Summary Related Work Conclusions A Glossary Bibliography vii List of Figures Examples of p otentially parallel calls One example implementation of a logically unb ounded stack assigned to each thread The entire structure is called a cactus stack Logical task graphs The syntax of the formal descriptions Formal denition of a thread as a tuple Formal denition of a pro cessor as a tuple Formal denition of the MAM as a tuple Example translation of a function into psuedoco de for the MAM The thread exit instruction The yield instruction The susp end op eration The fork op eration for MAM The send instruction The ireturn instruction The legal state transitions for a thread in MAM The enable instruction Describ es how an idle pro cessor gets work from the ready queue Redenition of the machine and thread for MAMDF The legal state transitions for a thread in the MAMDF without work stealing The fork op eration under MAMDF The dfork op eration under MAMDF transfers control directly to the forked child The dexit op eration under MAMDF transfers control directly to the parent The susp end op eration under MAMDF for a lazy thread The new state transitions for a thread in MAMDF using continuation stealing Formal semantics for stealing work using continuation stealing Three p ossible resumption p oints after executing dfork Example thread seed and seed co de fragments asso ciated with dforks Work stealing though seed activation LIST OF FIGURES viii Example translation of a function into psuedoco de for the MAMDF using thread seeds Example of seed activation The seed return instruction is used to complete a seed routine Example translation of a function into psuedoco de for the MAMDS using thread seeds Redenition of a thread for MAMDS A pseudoco de sequence for an lfork with its two asso ciated return paths The lfork op eration in MAMDS The lreturn op eration in MAMDS when the parent and child are connected and the parent has not b een resumed since it lforked the child The susp end op eration in MAMDS The lreturn op eration when the parent and child have b een disconnected by acontinuation stealing op eration The

Lazy Threads Compiler and Runtime Structures For

Performance Portability Libraries

A Practical Quantum Instruction Set Architecture

The Implementation of Prolog Via VAX 8600 Microcode ABSTRACT

GPU As a Prototype for Next Generation Supercomputer Viktor K

Copyright © IEEE. Citation for the Published Paper

Multicore Semantics and Programming Mark Batty

The Abstract Machine a Pattern for Designing Abstract Machines

Abstract Machines for Programming Language Implementation

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

A Retrospective on the VAX VMM Security Kernel

Deriving an Abstract Machine for Strong Call by Need

A Better X86 Memory Model: X86-TSO