Multicore Programming Models and Their Implementation Challenges

Total Page:16

File Type:pdf, Size:1020Kb

Multicore Programming Models and Their Implementation Challenges Multicore Programming Models and their Implementation Challenges Vivek Sarkar Rice University [email protected] FPU ISU ISU FPU FXU FXU IDU IDU IFU LSU LSU IFU BXU BXU L2 L2 L2 L3 Directory/Control Inverted Pyramid of Parallel Programming Skills Mainstream Parallelism-Oblivious (Joe) Developers Focus of this talk: Parallelism-Aware bridging the gap between (Stephanie) Developers mainstream developers and concurrency experts (Doug) Concurrency Experts 2 Outline . A Sample of Modern Multicore Programming Models . X10 + Habanero Execution Model and Implementation Challenges . Rice Habanero Multicore Software research project 3 Comparison of Multicore Programming Models along Selected Dimensions Dynamic Locality Control Mutual Collective & Data Parallelism Parallelism Exclusion Point-to-point Synchronization Cilk Spawn, sync None Locks None None Java Executors, None Locks, monitors, Synchronizers Concurrent collections Concurrency Task Queues atomic classes Intel TBB Generic algs, None Locks, atomic None Concurrent containers tasks classes .Net Parallel Generic algs, None Locks, monitors Futures PLINQ Extensions tasks OpenMP SPMD (v2.5), None Locks, critical, Barriers None Tasks (v3.0) atomic CUDA None Device, grid, None Barriers SIMD block, threads Intel Concurrent Tagged None None Tagged put & get None Collections prescription of operations on Item steps Collections X10 + Habanero Async, finish Places Atomic blocks, Future, force, SIMD/MIMD array extensions Java atomic clocks, phasers, operations, Java (builds on Java classes delayed async concurrent collections Concurrency) 4 Dynamic Parallelism in Cilk Identifies a function cilk int fib (int n) { as a Cilk procedure, if (n<2) return (n); else { capable of being int x,y; spawned in parallel. x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); The named child } Cilk procedure } can execute in parallel with the Control cannot pass this parent caller. point until all spawned children have returned. 5 Cilk Language . Cilk is a faithful extension of C . if Cilk keywords are elided → C program semantics . Except for inlets … . Limitations on software reuse and composability — spawn keyword can only be applied to a cilk function — spawn keyword cannot be used in a C function — cilk function cannot be called with normal C call conventions . must be called with a spawn & waited for by a sync . Fully strict – parent must wait for child for terminate 6 Callbacks with Cilk Inlets 7 Summary of Cilk constructs . spawn – create child task . sync – await termination of child tasks . SYNCHED – check if child tasks have terminated . inlet – callback computation upon termination of child task . Advanced features (for “Doug”) . Cilk_lockvar . Cilk_fence() . abort 8 java.util.concurrent . General purpose toolkit for developing concurrent applications . No more “reinventing the wheel”! . Goals: “Something for Everyone!” . Make some problems trivial to solve by everyone (“Joe”) . Develop thread-safe classes, such as servlets, built on concurrent building blocks like ConcurrentHashMap . Make some problems easier to solve by concurrent programmers (“Stephanie”) . Develop concurrent applications using thread pools, barriers, latches, and blocking queues . Make some problems possible to solve by concurrency experts (“Doug”) . Develop custom locking classes, lock-free algorithms 9 Overview of j.u.c . Executors • Concurrent Collections . Executor — ConcurrentMap . ExecutorService — ConcurrentHashMap . ScheduledExecutorService — CopyOnWriteArray{List,Set} . Callable • Synchronizers . Future — CountDownLatch . ScheduledFuture — Semaphore . Delayed — Exchanger . CompletionService . ThreadPoolExecutor — CyclicBarrier . ScheduledThreadPoolExecutor • Locks: java.util.concurrent.locks . AbstractExecutorService — Lock . Executors — Condition . FutureTask — ReadWriteLock . ExecutorCompletionService — AbstractQueuedSynchronizer . Queues — LockSupport . BlockingQueue — ReentrantLock . ConcurrentLinkedQueue — ReentrantReadWriteLock . LinkedBlockingQueue • Atomics: java.util.concurrent.atomic . ArrayBlockingQueue — Atomic[Type] . SynchronousQueue — Atomic[Type]Array . PriorityBlockingQueue — Atomic[Type]FieldUpdater . DelayQueue — Atomic{Markable,Stampable}Reference 10 Intel® Threading Building Blocks Components (Stephanie) (Joe) Generic Parallel Algorithms Concurrent Containers parallel_for concurrent_hash_map parallel_reduce concurrent_queue pipeline concurrent_vector parallel_sort parallel_while parallel_scan Task scheduler (Doug) Low-Level Synchronization Primitives Memory Allocation atomic cache_aligned_allocator mutex scalable_allocator spin_mutex queuing_mutex spin_rw_mutex Timing queuing_rw_mutex tick_count 11 Serial Example static void SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i!=n; ++i ) Foo(a[i]); } Will parallelize by dividing iteration space of i into chunks 12 Parallel Version Task TBB extensions in red class ApplyFoo { float *const my_a; public: ApplyFoo( float *a ) : my_a(a) {} void operator()( const blocked_range<size_t>& range ) const { float *a = my_a; for( int i= range.begin(); i!=range.end(); ++i ) Foo(a[i]); } }; Pattern Iteration space void ParallelApplyFoo(float a[], size_t n ) { parallel_for( blocked_range<int>( 0, n ), ApplyFoo(a), auto_partitioner() ); } Automatic grain size 13 Microsoft Parallel LINQ (PLINQ) . Declarative data parallelism via LINQ-to-Objects . PLINQ supports all LINQ operators . Select, Where, Join, GroupBy, Sum, etc. Activated with the AsParallel extension method: . var q = from x in data where p(x) orderby k(x) select f(x); var q = from x in data.AsParallel() where p(x) orderby k(x) select f(x); . Works for any IEnumerable<T> . Query syntax enables runtime to auto-parallelize . Automatic way to generate more Tasks, like Parallel . Graph analysis determines how to do it . Classic data parallelism: partitioning + pipelining 14 PLINQ “Gotchas” . Ordering not guaranteed . int[] data = new int[] { 0, 1, 2 }; var q = from x in data.AsParallel(QueryOptions.PreserveOrdering) select x * 2; int[] scaled = q.ToArray(); // == { 0, 2, 4 }? . Exceptions are aggregated . object[] data = new object[] { "foo", null, null }; var q = from x in data.AsParallel() select o.ToString(); . NullRefExceptions on data[1], data[2], or both? . PLINQ will always aggregate under AggregateException . Side effects and mutability are serious issues . Most queries do not use side effects… but: . var q = from x in data.AsParallel() select x.f++; . OK if all elements in data are unique, else race condition 15 Intel Concurrent Collections (CnC) The application problem Domain Expert: (person) Decomposition into Serial Steps Only domain knowledge Decomposition constraints required by the application No tuning knowledge “Joe” Concurrent Collections Program Architecture Tuning Expert: (person, runtime, compiler) Actual parallelism No domain knowledge Locality Only tuning knowledge Overhead Load balancing “Stephanie” Distribution among processors Scheduling within a processor Mapping to target platform Source: Kathleen Knobe 16 http://softwarecommunity.intel.com/articles/eng/3862.htm Intel Concurrent Collections (Cell detector) (contd) [Cell candidates ] [Histograms] [Input Image] (Cell tracker) [Labeled cells initial] (Arbitrator initial) [State measurements] (Correction filter) [Motion corrections] [Labeled cells final] (Arbitrator final) [Final states] What are the high level operations? (Steps) (Prediction filter) What are the chunks of data? (Item collections) [Predicted states] What are the producer/consumer relationships? What are the inputs and outputs? 17 Make it precise enough to execute (Cell detector: K) [Cell candidates: K ] [Histograms: K] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] (Correction filter: K, cell) [Motion corrections: K , cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] (Prediction filter: K, cell) How do we distinguish among the instances? [Predicted states: K, cell] 18 Source: Kathleen Knobe Make it precise enough to execute (Cell detector: K) [Cell candidates: K ] [Histograms: K] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] <ImageTags: K> <Cell tags Initial: K, cell> (Correction filter: K, cell) [Motion corrections: K , cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] <Cell tags Final: K, cell> (Prediction filter: K, cell) How do we distinguish among the instances? Tags [Predicted states: K, cell] What are the sets of instances? <Prescriptions> Who determines these sets? [Steps] 19 Source: Kathleen Knobe Domain Expert’s view of Concurrent Collections . No thinking about parallelism . Only domain knowledge . No overwriting . Collections on single assignment . No arbitrary serialization . only constraints on ordering via tagged puts and gets . Steps are functional . No side-effects . Result is: . Deterministic . Race-free 20 Comparison of Multicore Programming Models along Selected Dimensions Dynamic Locality Control Mutual Collective & Data Parallelism Parallelism Exclusion Point-to-point Synchronization Cilk Spawn, sync None Locks None None Java Executors, None Locks, monitors, Synchronizers Concurrent collections Concurrency Task Queues atomic classes Intel TBB Generic algs, None Locks, atomic None Concurrent containers tasks classes .Net Parallel Generic algs, None Locks, monitors Futures PLINQ Extensions tasks OpenMP SPMD (v2.5), None Locks, critical, Barriers None Tasks (v3.0) atomic
Recommended publications
  • Other Apis What’S Wrong with Openmp?
    Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor.
    [Show full text]
  • Multithreading (Concurrency) in Java “Some People, When Confronted
    Multithreading (Concurrency) in Java “Some people, when confronted with a problem, think, "I know, I’ll use threads," and then two they hav erpoblesms.” — Ned Batchelder Multi-tasking In the beginning, computers ran in batch mode: a single program with total access to all resources. Later, a multi-tasking operating system enabled multiple (independent) jobs to run (more or less) simultaneously, even though there was only a single CPU. The jobs take turns running some of their code at one time. This type of interleaving of jobs is called concurrency. Multitasking is performing two or more tasks (or jobs) at roughly the same time. Nearly all operating systems are capable of multitasking by using one of two multitasking techniques: process-based multitasking and thread-based multitasking. Each running job was (and sometimes still is) called a process, which has state (a collection of data and handles to system resources such as files and devices) and a single sequence of instructions. It turns out that many tasks are easier to write, debug, and manage if organized as multiple copies of a process (e.g., a web server). Processes are sometimes called tasks, threads, or jobs. (Although there are some distinctions between these concepts, we will ignore them for now.) Time slice, (context) switching, and scheduling are operating system (“OS”) terms related to multi-tasking. The amount of time a task is allowed to use the CPU exclusively is called a time slice, typically 10ms (a hundredth of a second) but can vary depending on your OS and other factors from 1ms to more than 120ms.
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Parallel Computing a Key to Performance
    Parallel Computing A Key to Performance Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India http://www.cse.iitd.ac.in/~dheerajb Dheeraj Bhardwaj <[email protected]> August, 2002 1 Introduction • Traditional Science • Observation • Theory • Experiment -- Most expensive • Experiment can be replaced with Computers Simulation - Third Pillar of Science Dheeraj Bhardwaj <[email protected]> August, 2002 2 1 Introduction • If your Applications need more computing power than a sequential computer can provide ! ! ! ❃ Desire and prospect for greater performance • You might suggest to improve the operating speed of processors and other components. • We do not disagree with your suggestion BUT how long you can go ? Can you go beyond the speed of light, thermodynamic laws and high financial costs ? Dheeraj Bhardwaj <[email protected]> August, 2002 3 Performance Three ways to improve the performance • Work harder - Using faster hardware • Work smarter - - doing things more efficiently (algorithms and computational techniques) • Get help - Using multiple computers to solve a particular task. Dheeraj Bhardwaj <[email protected]> August, 2002 4 2 Parallel Computer Definition : A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”. Driving Forces and Enabling Factors Desire and prospect for greater performance Users have even bigger problems and designers have even more gates Dheeraj Bhardwaj <[email protected]>
    [Show full text]
  • Parallelism in Cilk Plus
    Cilk Plus: Language Support for Thread and Vector Parallelism Arch D. Robison Intel Sr. Principal Engineer Outline Motivation for Intel® Cilk Plus SIMD notations Fork-Join notations Karatsuba multiplication example GCC branch Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Multi-Threading and Vectorization are Essential to Performance Latest Intel® Xeon® chip: 8 cores 2 independent threads per core 8-lane (single prec.) vector unit per thread = 128-fold potential for single socket Intel® Many Integrated Core Architecture >50 cores (KNC) ? independent threads per core 16-lane (single prec.) vector unit per thread = parallel heaven Copyright© 2012, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners. Importance of Abstraction Software outlives hardware. Recompiling is easier than rewriting. Coding too closely to hardware du jour makes moving to new hardware expensive. C++ philosophy: abstraction with minimal penalty Do not expect compiler to be clever. But let it do tedious bookkeeping. Copyright© 2012, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners. “Three Layer Cake” Abstraction Message Passing exploit multiple nodes Fork-Join exploit multiple cores exploit parallelism at multiple algorithmic levels SIMD exploit vector hardware Copyright© 2012, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners. Composition Message Driven compose via send/receive Fork-Join compose via call/return SIMD compose sequentially Copyright© 2012, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners.
    [Show full text]
  • Shared Objects & Mutual Exclusion
    Chapter 4 Shared Objects & Mutual Exclusion Shared Objects & Concepts: process interference. Mutual Exclusion mutual exclusion and locks. Models: model checking for interference modelling mutual exclusion Practice: thread interference in shared Java objects mutual exclusion in Java (synchronized objects/methods). 1 2 2015 Concurrency: shared objects & mutual exclusion ©Magee/Kramer 2nd Edition 2015 Concurrency: shared objects & mutual exclusion ©Magee/Kramer 2nd Edition A Concert Hall Booking System Concert Hall Booking System const False = 0 A central computer connected to remote terminals via communication links const True = 1 is used to automate seat reservations for a concert hall. range Bool = False..True To book a seat, a client chooses a free seat and the clerk enters the number SEAT = SEAT[False], of the chosen seat at the terminal and issues a ticket, if it is free. SEAT[reserved:Bool] = ( when (!reserved) reserve -> SEAT[True] A system is required which avoids double bookings of the same seat whilst | query[reserved] -> SEAT[reserved] allowing clients free choice of the available seats. | when (reserved) reserve -> ERROR //error of reserved twice Construct an abstract model of the system and demonstrate that your model does ). not permit double bookings. Like STOP, ERROR range Seats = 1..2 is a predefined FSP ||SEATS = (seat[Seats]:SEAT). local process (state), numbered -1 in the 3 equivalent LTS. 4 2015 Concurrency: shared objects & mutual exclusion ©Magee/Kramer 2nd Edition 2015 Concurrency: shared objects & mutual exclusion ©Magee/Kramer 2nd Edition Concert Hall Booking System Concert Hall Booking System – no interference? LOCK = (acquire -> release -> LOCK). TERMINAL = (choose[s:Seats] //lock for the booking system -> seat[s].query[reserved:Bool] -> if (!reserved) then TERMINAL = (choose[s:Seats] -> acquire (seat[s].reserve -> TERMINAL) -> seat[s].query[reserved:Bool] else -> if (!reserved) then TERMINAL (seat[s].reserve -> release-> TERMINAL) ).
    [Show full text]
  • Actor-Based Concurrency by Srinivas Panchapakesan
    Concurrency in Java and Actor- based concurrency using Scala By, Srinivas Panchapakesan Concurrency Concurrent computing is a form of computing in which programs are designed as collections of interacting computational processes that may be executed in parallel. Concurrent programs can be executed sequentially on a single processor by interleaving the execution steps of each computational process, or executed in parallel by assigning each computational process to one of a set of processors that may be close or distributed across a network. The main challenges in designing concurrent programs are ensuring the correct sequencing of the interactions or communications between different computational processes, and coordinating access to resources that are shared among processes. Advantages of Concurrency Almost every computer nowadays has several CPU's or several cores within one CPU. The ability to leverage theses multi-cores can be the key for a successful high-volume application. Increased application throughput - parallel execution of a concurrent program allows the number of tasks completed in certain time period to increase. High responsiveness for input/output-intensive applications mostly wait for input or output operations to complete. Concurrent programming allows the time that would be spent waiting to be used for another task. More appropriate program structure - some problems and problem domains are well-suited to representation as concurrent tasks or processes. Process vs Threads Process: A process runs independently and isolated of other processes. It cannot directly access shared data in other processes. The resources of the process are allocated to it via the operating system, e.g. memory and CPU time. Threads: Threads are so called lightweight processes which have their own call stack but an access shared data.
    [Show full text]
  • Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions
    Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions Elliott Slaughter Wonchan Lee Sean Treichler Stanford University Stanford University Stanford University SLAC National Accelerator Laboratory [email protected] NVIDIA [email protected] [email protected] Wen Zhang Michael Bauer Galen Shipman Stanford University NVIDIA Los Alamos National Laboratory [email protected] [email protected] [email protected] Patrick McCormick Alex Aiken Los Alamos National Laboratory Stanford University [email protected] [email protected] ABSTRACT 1 for t = 0, T do We present control replication, a technique for generating high- 1 for i = 0, N do −− Parallel 2 for i = 0, N do −− Parallel performance and scalable SPMD code from implicitly parallel pro- 2 for t = 0, T do 3 B[i] = F(A[i]) grams. In contrast to traditional parallel programming models that 3 B[i] = F(A[i]) 4 end 4 −− Synchronization needed require the programmer to explicitly manage threads and the com- 5 for j = 0, N do −− Parallel 5 A[i] = G(B[h(i)]) munication and synchronization between them, implicitly parallel 6 A[j] = G(B[h(j)]) 6 end programs have sequential execution semantics and by their nature 7 end 7 end avoid the pitfalls of explicitly parallel programming. However, with- 8 end (b) Transposed program. out optimizations to distribute control overhead, scalability is often (a) Original program. poor. Performance on distributed-memory machines is especially sen- F(A[0]) . G(B[h(0)]) sitive to communication and synchronization in the program, and F(A[1]) . G(B[h(1)]) thus optimizations for these machines require an intimate un- .
    [Show full text]
  • Sec9on 6: Mutual Exclusion
    Sec$on 6: Mutual Exclusion Michelle Ku5el [email protected] Toward sharing resources (memory) Have been studying parallel algorithms using fork- join – Lower span via parallel tasks Algorithms all had a very simple structure to avoid race condi$ons – Each thread had memory “only it accessed” • Example: array sub-range – On fork, “loaned” some of its memory to “forkee” and did not access that memory again un$l aer join on the “forkee” slide adapted from: Sophomoric Parallelism & 2 Concurrency, Lecture 4 Toward sharing resources (memory) Strategy won’t work well when: – Memory accessed by threads is overlapping or unpredictable – Threads are doing independent tasks needing access to same resources (rather than implemen$ng the same algorithm) slide adapted from: Sophomoric Parallelism & 3 Concurrency, Lecture 4 Race Condi$ons A race condi*on is a bug in a program where the output and/or result of the process is unexpectedly and cri$cally dependent on the relave sequence or $ming of other events. The idea is that the events race each other to influence the output first. Examples Mul$ple threads: 1. Processing different bank-account operaons – What if 2 threads change the same account at the same $me? 2. Using a shared cache (e.g., hashtable) of recent files – What if 2 threads insert the same file at the same $me? 3. Creang a pipeline (think assembly line) with a queue for handing work to next thread in sequence? – What if enqueuer and dequeuer adjust a circular array queue at the same $me? Sophomoric Parallelism & 5 Concurrency, Lecture 4 Concurrent
    [Show full text]
  • Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions
    REGENT: A HIGH-PRODUCTIVITY PROGRAMMING LANGUAGE FOR IMPLICIT PARALLELISM WITH LOGICAL REGIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Elliott Slaughter August 2017 © 2017 by Elliott David Slaughter. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/mw768zz0480 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Alex Aiken, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Philip Levis I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Oyekunle Olukotun Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Modern supercomputers are dominated by distributed-memory machines. State of the art high-performance scientific applications targeting these machines are typically written in low-level, explicitly parallel programming models that enable maximal performance but expose the user to programming hazards such as data races and deadlocks.
    [Show full text]
  • A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
    TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization Blake A. Hechtman, Andrew D. Hilton, and Daniel J. Sorin Department of Electrical and Computer Engineering Duke University Abstract —We have developed a task-parallel runtime targeting CPUs are a poor fit for GPUs. To understand system, called TREES, that is designed for high why this mismatch exists, we must first understand the performance on CPU/GPU platforms. On platforms performance of an idealized task-parallel application with multiple CPUs, Cilk’s “work-first” principle (with no runtime) and then how the runtime’s overhead underlies how task-parallel applications can achieve affects it. The performance of a task-parallel application performance, but work-first is a poor fit for GPUs. We is a function of two characteristics: its total amount of build upon work-first to create the “work-together” work to be performed (T1, the time to execute on 1 principle that addresses the specific strengths and processor) and its critical path (T∞, the time to execute weaknesses of GPUs. The work-together principle on an infinite number of processors). Prior work has extends work-first by stating that (a) the overhead on shown that the runtime of a system with P processors, the critical path should be paid by the entire system at TP, is bounded by = ( ) + ( ) due to the once and (b) work overheads should be paid co- greedy o ff line scheduler bound [3][10]. operatively. We have implemented the TREES runtime A task-parallel runtime introduces overheads and, for in OpenCL, and we experimentally evaluate TREES purposes of performance analysis, we distinguish applications on a CPU/GPU platform.
    [Show full text]
  • Java Concurrency Framework by Sidartha Gracias
    Java Concurrency Framework Sidartha Gracias Executive Summary • This is a beginners introduction to the java concurrency framework • Some familiarity with concurrent programs is assumed – However the presentation does go through a quick background on concurrency – So readers unfamiliar with concurrent programming should still get something out of this • The structure of the presentation is as follows – A brief history into concurrent programming is provided – Issues in concurrent programming are explored – The framework structure with all its major components are covered in some detail – Examples are provided for each major section to reinforce some of the major ideas Concurrency in Java - Overview • Java like most other languages supports concurrency through thread – The JVM creates the Initial thread, which begins execution from main – The main method can then spawn additional threads Thread Basics • All modern OS support the idea of processes – independently running programs that are isolated from each other • Thread can be thought of as light weight processes – Like processes they have independent program counters, call stacks etc – Unlike Processes they share main memory, file pointers and other process state – This means thread are easier for the OS to maintain and switch between – This also means we need to synchronize threads for access to shared resources Threads Continued… • So why use threads ? – Multi CPU systems: Most modern systems host multiple CPU’s, by splitting execution between them we can greatly speed up execution – Handling Asynchronous Events: Servers handle multiple clients. Processing each client is best done through a separate thread, because the Server blocks until a new message is received – UI or event driven Processing: event handlers that respond to user input are best handled through separate threads, this makes code easier to write and understand Synchronization Primitives in Java • How does the java language handle synchronization – Concurrent execution is supported through the Thread class.
    [Show full text]