Multicore Programming Models and Their Implementation Challenges

Multicore Programming Models and their Implementation Challenges Vivek Sarkar Rice University [email protected] FPU ISU ISU FPU FXU FXU IDU IDU IFU LSU LSU IFU BXU BXU L2 L2 L2 L3 Directory/Control Inverted Pyramid of Parallel Programming Skills Mainstream Parallelism-Oblivious (Joe) Developers Focus of this talk: Parallelism-Aware bridging the gap between (Stephanie) Developers mainstream developers and concurrency experts (Doug) Concurrency Experts 2 Outline . A Sample of Modern Multicore Programming Models . X10 + Habanero Execution Model and Implementation Challenges . Rice Habanero Multicore Software research project 3 Comparison of Multicore Programming Models along Selected Dimensions Dynamic Locality Control Mutual Collective & Data Parallelism Parallelism Exclusion Point-to-point Synchronization Cilk Spawn, sync None Locks None None Java Executors, None Locks, monitors, Synchronizers Concurrent collections Concurrency Task Queues atomic classes Intel TBB Generic algs, None Locks, atomic None Concurrent containers tasks classes .Net Parallel Generic algs, None Locks, monitors Futures PLINQ Extensions tasks OpenMP SPMD (v2.5), None Locks, critical, Barriers None Tasks (v3.0) atomic CUDA None Device, grid, None Barriers SIMD block, threads Intel Concurrent Tagged None None Tagged put & get None Collections prescription of operations on Item steps Collections X10 + Habanero Async, finish Places Atomic blocks, Future, force, SIMD/MIMD array extensions Java atomic clocks, phasers, operations, Java (builds on Java classes delayed async concurrent collections Concurrency) 4 Dynamic Parallelism in Cilk Identifies a function cilk int fib (int n) { as a Cilk procedure, if (n<2) return (n); else { capable of being int x,y; spawned in parallel. x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); The named child } Cilk procedure } can execute in parallel with the Control cannot pass this parent caller. point until all spawned children have returned. 5 Cilk Language . Cilk is a faithful extension of C . if Cilk keywords are elided → C program semantics . Except for inlets … . Limitations on software reuse and composability — spawn keyword can only be applied to a cilk function — spawn keyword cannot be used in a C function — cilk function cannot be called with normal C call conventions . must be called with a spawn & waited for by a sync . Fully strict – parent must wait for child for terminate 6 Callbacks with Cilk Inlets 7 Summary of Cilk constructs . spawn – create child task . sync – await termination of child tasks . SYNCHED – check if child tasks have terminated . inlet – callback computation upon termination of child task . Advanced features (for “Doug”) . Cilk_lockvar . Cilk_fence() . abort 8 java.util.concurrent . General purpose toolkit for developing concurrent applications . No more “reinventing the wheel”! . Goals: “Something for Everyone!” . Make some problems trivial to solve by everyone (“Joe”) . Develop thread-safe classes, such as servlets, built on concurrent building blocks like ConcurrentHashMap . Make some problems easier to solve by concurrent programmers (“Stephanie”) . Develop concurrent applications using thread pools, barriers, latches, and blocking queues . Make some problems possible to solve by concurrency experts (“Doug”) . Develop custom locking classes, lock-free algorithms 9 Overview of j.u.c . Executors • Concurrent Collections . Executor — ConcurrentMap . ExecutorService — ConcurrentHashMap . ScheduledExecutorService — CopyOnWriteArray{List,Set} . Callable • Synchronizers . Future — CountDownLatch . ScheduledFuture — Semaphore . Delayed — Exchanger . CompletionService . ThreadPoolExecutor — CyclicBarrier . ScheduledThreadPoolExecutor • Locks: java.util.concurrent.locks . AbstractExecutorService — Lock . Executors — Condition . FutureTask — ReadWriteLock . ExecutorCompletionService — AbstractQueuedSynchronizer . Queues — LockSupport . BlockingQueue — ReentrantLock . ConcurrentLinkedQueue — ReentrantReadWriteLock . LinkedBlockingQueue • Atomics: java.util.concurrent.atomic . ArrayBlockingQueue — Atomic[Type] . SynchronousQueue — Atomic[Type]Array . PriorityBlockingQueue — Atomic[Type]FieldUpdater . DelayQueue — Atomic{Markable,Stampable}Reference 10 Intel® Threading Building Blocks Components (Stephanie) (Joe) Generic Parallel Algorithms Concurrent Containers parallel_for concurrent_hash_map parallel_reduce concurrent_queue pipeline concurrent_vector parallel_sort parallel_while parallel_scan Task scheduler (Doug) Low-Level Synchronization Primitives Memory Allocation atomic cache_aligned_allocator mutex scalable_allocator spin_mutex queuing_mutex spin_rw_mutex Timing queuing_rw_mutex tick_count 11 Serial Example static void SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i!=n; ++i ) Foo(a[i]); } Will parallelize by dividing iteration space of i into chunks 12 Parallel Version Task TBB extensions in red class ApplyFoo { float *const my_a; public: ApplyFoo( float *a ) : my_a(a) {} void operator()( const blocked_range<size_t>& range ) const { float *a = my_a; for( int i= range.begin(); i!=range.end(); ++i ) Foo(a[i]); } }; Pattern Iteration space void ParallelApplyFoo(float a[], size_t n ) { parallel_for( blocked_range<int>( 0, n ), ApplyFoo(a), auto_partitioner() ); } Automatic grain size 13 Microsoft Parallel LINQ (PLINQ) . Declarative data parallelism via LINQ-to-Objects . PLINQ supports all LINQ operators . Select, Where, Join, GroupBy, Sum, etc. Activated with the AsParallel extension method: . var q = from x in data where p(x) orderby k(x) select f(x); var q = from x in data.AsParallel() where p(x) orderby k(x) select f(x); . Works for any IEnumerable<T> . Query syntax enables runtime to auto-parallelize . Automatic way to generate more Tasks, like Parallel . Graph analysis determines how to do it . Classic data parallelism: partitioning + pipelining 14 PLINQ “Gotchas” . Ordering not guaranteed . int[] data = new int[] { 0, 1, 2 }; var q = from x in data.AsParallel(QueryOptions.PreserveOrdering) select x * 2; int[] scaled = q.ToArray(); // == { 0, 2, 4 }? . Exceptions are aggregated . object[] data = new object[] { "foo", null, null }; var q = from x in data.AsParallel() select o.ToString(); . NullRefExceptions on data[1], data[2], or both? . PLINQ will always aggregate under AggregateException . Side effects and mutability are serious issues . Most queries do not use side effects… but: . var q = from x in data.AsParallel() select x.f++; . OK if all elements in data are unique, else race condition 15 Intel Concurrent Collections (CnC) The application problem Domain Expert: (person) Decomposition into Serial Steps Only domain knowledge Decomposition constraints required by the application No tuning knowledge “Joe” Concurrent Collections Program Architecture Tuning Expert: (person, runtime, compiler) Actual parallelism No domain knowledge Locality Only tuning knowledge Overhead Load balancing “Stephanie” Distribution among processors Scheduling within a processor Mapping to target platform Source: Kathleen Knobe 16 http://softwarecommunity.intel.com/articles/eng/3862.htm Intel Concurrent Collections (Cell detector) (contd) [Cell candidates ] [Histograms] [Input Image] (Cell tracker) [Labeled cells initial] (Arbitrator initial) [State measurements] (Correction filter) [Motion corrections] [Labeled cells final] (Arbitrator final) [Final states] What are the high level operations? (Steps) (Prediction filter) What are the chunks of data? (Item collections) [Predicted states] What are the producer/consumer relationships? What are the inputs and outputs? 17 Make it precise enough to execute (Cell detector: K) [Cell candidates: K ] [Histograms: K] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] (Correction filter: K, cell) [Motion corrections: K , cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] (Prediction filter: K, cell) How do we distinguish among the instances? [Predicted states: K, cell] 18 Source: Kathleen Knobe Make it precise enough to execute (Cell detector: K) [Cell candidates: K ] [Histograms: K] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] <ImageTags: K> <Cell tags Initial: K, cell> (Correction filter: K, cell) [Motion corrections: K , cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] <Cell tags Final: K, cell> (Prediction filter: K, cell) How do we distinguish among the instances? Tags [Predicted states: K, cell] What are the sets of instances? <Prescriptions> Who determines these sets? [Steps] 19 Source: Kathleen Knobe Domain Expert’s view of Concurrent Collections . No thinking about parallelism . Only domain knowledge . No overwriting . Collections on single assignment . No arbitrary serialization . only constraints on ordering via tagged puts and gets . Steps are functional . No side-effects . Result is: . Deterministic . Race-free 20 Comparison of Multicore Programming Models along Selected Dimensions Dynamic Locality Control Mutual Collective & Data Parallelism Parallelism Exclusion Point-to-point Synchronization Cilk Spawn, sync None Locks None None Java Executors, None Locks, monitors, Synchronizers Concurrent collections Concurrency Task Queues atomic classes Intel TBB Generic algs, None Locks, atomic None Concurrent containers tasks classes .Net Parallel Generic algs, None Locks, monitors Futures PLINQ Extensions tasks OpenMP SPMD (v2.5), None Locks, critical, Barriers None Tasks (v3.0) atomic

Load more