An E Cient and General Implementation of Futures on Large

An Ecient and General Implementation of Futures on Large Scale SharedMemory Multipro cessors A Dissertation Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science James S Miller advisor In Partial Fulllment of the Requirements of the Degree of Doctor of Philosophy by Marc Feeley April This dissertation directed and approved by the candidates committee has b een ac cepted and approved by the Graduate Faculty of Brandeis University in partial fulll ment of the requirements for the degree of DOCTOR OF PHILOSOPHY Dean Graduate School of Arts and Sciences Dissertation Committee Dr James S Miller chair Digital Equipment Corp oration Prof Harry Mairson Prof Timothy Hickey Prof David Waltz Dr Rob ert H Halstead Jr Digital Equipment Corp oration Copyright by Marc Feeley Abstract An Ecient and General Implementation of Futures on Large Scale SharedMemory Multipro cessors A dissertation presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University Waltham Massachusetts by Marc Feeley This thesis describ es a highp erformance implementation technique for Multilisps future parallelism construct This metho d addresses the nonuniform memory access NUMA problem inherent in large scale sharedmemory multiprocessors The technique is based on lazy task creation LTC a dynamic task partitioning mechanism that dramatically reduces the cost of task creation and consequently makes it p ossible to exploit ne grain parallelism In LTC idle pro cessors get work to do by stealing tasks from other pro cessors A previously prop osed implementation of LTC is the sharedmemory SM proto col The main disadvantage of the SM proto col is that it requires the stack to b e cached sub optimally on cacheincoherent machines This thesis prop oses a new implementation technique for LTC that allows full caching of the stack the messagepassing MP proto col Idle pro cessors ask for work by sending work request messages to other pro cessors After receiving such a message a pro cessor checks its private stack and task queue and sends back a task if one is available The message passing proto col has the added b enets of a lower task creation cost and simpler algorithms Extensive exp eriments evaluate the p erformance of b oth proto cols on large sharedmemory multiprocessors a pro cessor GP and a pro cessor TC The results show that the MP proto col is consistently b etter than the SM proto col The dierence in p erformance is as high as a factor of two when a cache is available and a factor of when a cache is not available In addition the thesis shows that the semantics of the Multilisp language do es not have to b e imp overished to attain go o d p erformance The laziness of LTC can b e exploited to supp ort at virtually no cost several programming features including the KatzWeise continuation semantics with legitimacy dynamic scoping and fairness Acknowledgements Cette theseest dedieeames grandparents Rose et Emile Monna pour lamour que jai pour eux I wish to thank my family my friends and colleagues without whom this thesis would not have b een p ossible Sp ecial thanks go to Jim Miller my thesis advisor for giving me the freedom to explore my ideas at my own pace He has gone b eyond the call of duty to see me through with my degree Bert Halsteads words of encouragement gave me the condence that my ideas were interesting and worth writing ab out Thank you Bert Sabine Bergler deserves sp ecial thanks for taking care of me To Chris Mauricio Harry Emmanuel Don Shyam Larry Xiru Mary and Paulo thank you for making my stay at Brandeis so enjoyable Finally I wish to thank the National Science and Engineering Research Council of Canada and the Universitede Montrealfor nancial supp ort and Michigan State Uni versity Argonne National Lab oratory Lawrence Livermore National Lab oratory and the MIT AI Lab oratory for the use of their computers Contents Introduction Motivation Why Multilisp Fundamental Issues Architecture SharedMemory MIMD Computers NonUniform Memory Access Sharing Data Caches Memory Consistency The GP and TC Computers Memory Management Dynamic Partitioning Eager Task Creation Lazy Task Creation Overview Background Schemes Legacy FirstClass Continuations Continuation Passing Style Programming with Continuations Multilisps Mo del of Parallelism FUTURE and TOUCH i Placeholders Spawning Trees Types of Parallelism Pip eline Parallelism ForkJoin Parallelism Divide and Conquer Parallelism Implementing Eager Task Creation The Work Queue FUTURE and TOUCH Scheme Enco ding Chasing vs No Chasing Critical Sections Centralized vs Distributed Work Queue Fairness of Scheduling Dynamic Scoping Continuation Semantics Original Semantics MultiScheme Semantics KatzWeise Continuations KatzWeise Continuations with Legitimacy Implementing Legitimacy Sp eculation Barriers The Cost of Supp orting Legitimacy Benchmark Programs abisort allpairs fib mm mst poly qsort queens rantree ii scan sum tridiag The Performance of ETC Lazy Task Creation Overview of LTC Scheduling Task Stealing Behavior Task Susp ension Behavior Continuations for Futures Pro cedure Calling Convention Unlimited Extent Continuations Continuation Heapication Parsing Continuations Implementing FirstClass Continuations The LTC Mechanism The Lazy Task Queue Pushing and Popping Lazy Tasks Stealing Lazy Tasks The Dynamic Environment Queue The Problem of Overow The Heavyweight Task Queue Supp orting Weaker Continuation Semantics Synchronizing Access to the Task Stack The SharedMemory Proto col Avoiding Hardware Lo cks Cost of a Future on GP Impact of Memory Hierarchy on Performance The MessagePassing Proto col Really Lazy Task Creation Communicating Steal Requests Potential Problems with the MP Proto col Co de Generated for SM and MP Proto cols iii Summary Polling Eciently The Problem of Pro cedure Calls Co de Structure CallReturn Polling Short Lived Pro cedures Balanced Polling Subproblem Calls Reduction Calls Minimal Polling Handling Join Points Polling in Gambit Results .

An E Cient and General Implementation of Futures on Large

The Complexity of Flow Analysis in Higher-Order Languages

Fundamentals of Type Inference Systems

On the Resolution Semiring

Metatheorems About Convertibility in Typed Lambda Calculi

Chih-Hao Luke ONG Contents

ICFP 2008 Final Program

Database Query Languages Embedded in the Typed Lambda Calculus

First-Class Continuations Basic Research in Computer Science

Efficient Inference of Object Types

Notes on the History of Type Inference

Bottom-Up Beta-Reduction

Type Inference with Expansion Variables and Intersection Types in System E and an Exact Correspondence with Β-Reduction∗