Scheduling parallel program with Work stealing

Vincent Danjean, [email protected]

MOAIS project, INRIA Grenoble Rhône-Alpes Seminar at UFRGS, Porto Alegre, Brazil Moais Positioning

Grid MPSoC Cluster Multicore KAAPI GPU

To mutually adapt application and

ArTeCS / oct 2009 2 jeudi 12 novembre 2009 Moais Positioning

Grid MPSoC Cluster Multicore KAAPI GPU

To mutually adapt application and scheduling

ArTeCS / oct 2009 2 jeudi 12 novembre 2009 MOAIS group

• Funded by INRIA, CNRS, UJF, INPG • Part of LIG (Laboratoire d’Informatique de Grenoble) - http://moais.imag.fr • Jean-Louis Roch: project Leader • 10 researchers • 18 PhD students

ArTeCS / oct 2009 3 jeudi 12 novembre 2009 Research actions • Scheduling [Denis Trystram] - multi-objectif criteria, malleable and moldable model • Parallel Algorithms [Jean-Louis Roch] - adaptive algorithms • Virtual Reality [Bruno Rafn] - interactive simulation • Runtime for HPC [Thierry Gautier] - grid and cluster, multi-processor

ArTeCS / oct 2009 4 jeudi 12 novembre 2009 Outline

• Athapascan / Kaapi a software stack • Foundation of work stealing • Controls of the overheads • Experiments with STL algorithms

ArTeCS / oct 2009 5 jeudi 12 novembre 2009 Goal

• Write once, run anywhere... with guaranteed performance

• Problem: heterogeneity - variations of the environment (#cores, speed, failure...) - irregular computation

ArTeCS / oct 2009 6 jeudi 12 novembre 2009 KAAPI Overview

Application

KAAPI middleware

parallel architecture

ArTeCS / oct 2009 7 jeudi 12 novembre 2009 KAAPI Overview

Application

Model: abstract representation

KAAPI middleware

Algorithms: scheduling, fault tolerance protocol, ...

parallel architecture

ArTeCS / oct 2009 7 jeudi 12 novembre 2009 KAAPI Overview

“causal connexions”

Application

Model: abstract representation

KAAPI middleware

Algorithms: scheduling, fault

tolerance protocol, ... Performance

parallel architecture

ArTeCS / oct 2009 7 jeudi 12 novembre 2009 KAAPI Overview

Application

Model: abstract representation

KAAPI middleware

Algorithms: scheduling, fault

tolerance protocol, ... Performance

parallel architecture

ArTeCS / oct 2009 7 jeudi 12 novembre 2009 API: Athapascan • Global address space - Creation of objects in a global address space with ‘shared’ keyword • Task = function call - Creation with ‘Fork’ keyword (≠ unix fork) ~ spawn - Tasks only communicate through shared objects - Task declares access mode (read, write, concurrent write, exclusive) to shared objects • Automatic scheduling - Work stealing or graph partitioning ➡‘Sequential’ semantics • C++ library, not a language extension - C language extension + was prototyped

ArTeCS / oct 2009 8 jeudi 12 novembre 2009 Fibonacci example

struct Fibonacci { void operator()( int n, a1::Shared_w result ) { if (n < 2) result.write( n ); else { a1::Shared subresult1; a1::Shared subresult2; a1::Fork()(n-1, subresult1); a1::Fork()(n-2, subresult2); a1::Fork()(result, subresult1, subresult2); } } };

struct Sum { void operator()( a1::Shared_w result, a1::Shared_r sr1, a1::Shared_r sr2 ) { result.write( sr1.read() + sr2.read() ); } }

ArTeCS / oct 2009 9 jeudi 12 novembre 2009 Fibonacci example

struct Fibonacci { void operator()( int n, a1::Shared_w result ) { w->r if (n < 2) result.write( n ); else { dependencies a1::Shared subresult1; a1::Shared subresult2; a1::Fork()(n-1, subresult1); a1::Fork()(n-2, subresult2); a1::Fork()(result, subresult1, subresult2); } } };

struct Sum { void operator()( a1::Shared_w result, a1::Shared_r sr1, a1::Shared_r sr2 ) { result.write( sr1.read() + sr2.read() ); } }

ArTeCS / oct 2009 9 jeudi 12 novembre 2009 Semantics & C++ Elision

struct Fibonacci { void operator()( int n, a1::Shared_w result ) { if (n < 2) result.write( n ); else { a1::Shared subresult1; a1::Shared subresult2; a1::Fork()(n-1, subresult1); a1::Fork()(n-2, subresult2); a1::Fork()(result, subresult1, subresult2); } } };

struct Sum { void operator()( a1::Shared_w result, a1::Shared_r sr1, a1::Shared_r sr2 ) { result.write( sr1.read() + sr2.read() ); } }

ArTeCS / oct 2009 10 jeudi 12 novembre 2009 Semantics & C++ Elision

struct Fibonacci { void operator()( int n, a1::Shared_w subresult1; a1::Shared subresult2; a1::Fork()(n-1, subresult1); a1::Fork()(n-2, subresult2); a1::Fork()(result, subresult1, subresult2); } } };

struct Sum { void operator()( a1::Shared_w sr1, a1::Shared_r sr2 ) { result =rite( sr1.read() + sr2.read() ); } }

ArTeCS / oct 2009 11 jeudi 12 novembre 2009 Online construction

Fibonacci

result

Time ArTeCS / oct 2009 12 jeudi 12 novembre 2009 Online construction

Fibonacci Fibonacci

Fibonacci subres2 subres1

result

Sum

result

Time ArTeCS / oct 2009 12 jeudi 12 novembre 2009 Online construction

Fibonacci Fibonacci Fibonacci Fibonacci Fibonacci

subres2 subres1 subres1.1 subres1.2 result

Sum Fibonacci Sum

result subres2 subres1

Sum

result Time ArTeCS / oct 2009 12 jeudi 12 novembre 2009 Data Flow Graph Interests One abstract representation for •- Scheduling - data flow graph ⇒ precedences graph ⇒ dependences graph • classical numerical kernel ⇒ partitioning the dependences graph - Explicit data transfer - data to be transfered is explicit in the data flow graph • automatic overlapping communication by computation - Original Fault Tolerant Protocols - TIC [IET05, Europar05, TDSC09] • coupling scheduling by work stealing with abstract representation - CCK [TSI07, MCO08] • coordinated checkpointing with partial restart after failure - Sabotage Tolerance [PDP09] - dynamically adapt the execution to sabotage • Drawback: complexity to manage it ArTeCS / oct 2009 13 jeudi 12 novembre 2009 Stack management • Stack based allocation - Tasks and accesses to shared data are pushed in a stack - close to the management of the C function - O(1) allocation time - O(#parameters) initialization time

Shared sr1 Shared sr1 a1::Shared subresult1; Fibonacci, sr1 a1::Shared subresult2; a1::Fork()(n-1, subresult1); Fibonacci, sr2 a1::Fork()(n-2, subresult2); Sum, r, sr1, sr2 a1::Fork()(result, subresult1, subresult2);

Stack growth

ArTeCS / oct 2009 14 jeudi 12 novembre 2009 Execution model • A (dynamic) set of UNIX processes - communication by active message - Each process is multithreaded - The root process forks the main task • Threads inside processes are either - Performing work - excute tasks - idle and participate to scheduling -then it try to steal work (task) from other threads (work-stealing algorithm)

ArTeCS / oct 2009 15 jeudi 12 novembre 2009 2 Level Scheduling

K- Idle K-Processor

process process K-Processor

KAAPI Scheduler other process

Active Message OS scheduler over TCP/IP, Myrinet OS scheduler and SSH OS CPU CPU CPU CPU CPU

ArTeCS / oct 2009 16 jeudi 12 novembre 2009 Outline

• Athapascan / Kaapi a software stack • Foundation of work stealing • Controls of the overheads • Experiments with STL algorithms

ArTeCS / oct 2009 17 jeudi 12 novembre 2009 Principle of work stealing Working thread

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Principle of work stealing Working thread

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Principle of work stealing Working thread

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Principle of work stealing Working thread

Steal request Return result

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Principle of work stealing Working thread

Steal request Return result

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Principle of work stealing Working thread

Steal request Return result

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Principle of work stealing Working thread

Steal request Return result

Idle thread

Idle thread

ArTeCS / oct 2009 18 jeudi 12 novembre 2009 Work stealing queue • Task: basic unit of computation • Each thread has its own work queue - push( task ) -> () : push a task used by the - pop() -> task : pop a task owner thread - push / pop: LIFO order • A idle thread can steal task from a victim thread’s workqueue - steal() -> task : steal a task from the queue used by the - push / steal: FIFO order thief thread

Random selection of victim threads • ArTeCS / oct 2009 19 jeudi 12 novembre 2009 Assumption/Definitions • Fork/Join recursive parallel program - (formally: strict multithreaded computation) • Notation: - Tseq : “sequential work”

- T1 : execution time on 1 core of the parallel program, called “work”

- T∞ : time of execution on ∞ cores, called the “critical path” or the depth

- Tp : time of execution on P cores - P : the number of processors

ArTeCS / oct 2009 20 jeudi 12 novembre 2009 Remarks

def • P∞ = T1 / T∞ - available parallelism (maximum speedup) - e.g. Fibonacci(30) = 83040 -2 -7 - T1 = 5.285*10 s, T∞ = 3.08*10 s - P∞ ≈ 171591

•T1 / Tseq - measure the work overhead - the creation of tasks - the work queue management (push/pop)

ArTeCS / oct 2009 21 jeudi 12 novembre 2009 Theoretical bound

- Using work stealing scheduler with random selection of victim, the expected time on P processors is:

Tp = O(Tseq / P + T∞)

- The expected number of steal requests per thread Xp is:

Xp = O(T∞)

- [Blumofe, Leiseron, Focs 94, PPoPP 95], [Arora, Blumofe, Plaxton, SPAA 98], ...

ArTeCS / oct 2009 22 jeudi 12 novembre 2009 With data flow graph ? • Previous bounds: “fork/join” program - all created tasks are ready • Not true in data flow program - complexity to compute data flow constraints

• Fortunately, the same bound holds - [Galilée, Doreille, Cavalheiro, Roch, PACT 98] - [Gautier, Roch, Wagner, ICCS2007] - in general case, the sequential execution is a valid execution • do not compute data flow constraints except on rare events

ArTeCS / oct 2009 23 jeudi 12 novembre 2009 Remarks • Other interesting results - Space efcient - “Same” bounds for heterogenous machines (dynamic speeds) with slightly modified work stealing • [Bender, Rabin, SPAA00] - Tp = O(Tseq / (P ∏avrg) + βT∞ / ∏avrg) • If P << P∞ ➡ quasi linear speed up : Tp ~ c1Tseq / P

- Interest of fine grain parallel algorithm (T∞ << T1) ➡ P∞ >> 1 !! allows wide range of quasi linear speedup

ArTeCS / oct 2009 24 jeudi 12 novembre 2009 Practical bound • Cilk 95 - c1 ~ from 1 to 25, depends on the application

- c∞ ~ 1

- Tp ≈ c1Tseq /p + c∞T∞ • Athapascan/Kaapi - 1998: T1/Tseq ~ 1 - 10000: bad representation !

- 2004: T1/Tseq ~ 1 - 1000: optimization

- 2006: T1/Tseq ~ 1 - 100: stack representation

- 2008: T1/Tseq ~ 1 - 20: + data flow constraints computed during steal operation

ArTeCS / oct 2009 25 jeudi 12 novembre 2009 Related work

Cilk [94, 95, 98] • - one of the first language: 3 keywords cilk_spawn, cilk_sync, cilk_for - sequential semantic !!!! - theoretical guaranteed performance - shared memory Tascell [09] • - indolent closure creation (on demand) - distributed - using same idea as a Cilk extension [96] that has never been tested X10 [04, 08] • - experimental language for HPC - extend class of parallel program with work stealing strategy • Satin [01], Capsule [06]

ArTeCS / oct 2009 26 jeudi 12 novembre 2009 Related work

• Cilk [94, 95, 98] - one of the first languageAthapascan: 3 keywords cilk_/spawnKaapi, cilk [99,_sync 03,, cilk05,_ 07]for - sequential semantic •!!!!- macro data flow: 2 keywords Fork, Shared - theoretical guaranteed performance - sequential semantic !!!! - shared memory - theoretical guaranteed performance Tascell [09] - shared & distributed memory • - indolent closure creation (on demand) - distributed - fault tolerant protocols - using same idea as a Cilk extension [96] that has never been tested X10 [04, 08] - fine grain parallelism • - adaptive algorithm [TSI05 Fr, Europar06, - experimental language for HPCEuropar08] - extend class of parallel program- cooperative with work work stealing stealing strategy [09 Fr] • Satin [01], Capsule [06]

ArTeCS / oct 2009 26 jeudi 12 novembre 2009 Outline

• Athapascan / Kaapi a software stack • Foundation of work stealing • Controls of the overheads • Experiments with STL algorithms

ArTeCS / oct 2009 27 jeudi 12 novembre 2009 How to reduce Tseq/T1? Why ? => WORK overhead •- extra instructions from the sequential program - especially for short computation (->STL algorithms) • Three technics 1. adapt the grain size: stop parallelism after a threshold - but: increase T∞, reduce the average parallelism and increase the number of steal requests - difculty to adjust it automatically 2. reduce the cost to create task - ...ideally do not create task ! 3. optimize the cost of workqueue operations - difculty due to concurrent operations ArTeCS / oct 2009 28 jeudi 12 novembre 2009 How to reduce Tseq/T1? Why ? => WORK overhead •- extra instructions from the sequential program - especially for short computation (->STL algorithms) • Three technics 1. adapt the grain size: stop parallelism after a threshold - but: increase T∞, reduce the average parallelism and increase the number of steal requests - difculty to adjust it automatically 2. reduce the cost to create task - ...ideally do not create task ! 3. optimize the cost of workqueue operations - difculty due to concurrent operations ArTeCS / oct 2009 28 jeudi 12 novembre 2009 Cost of task

ArTeCS / oct 2009 29 jeudi 12 novembre 2009 Cost of task • Cost = Creation + Extra arithmetic work

ArTeCS / oct 2009 29 jeudi 12 novembre 2009 Cost of task • Cost = Creation + Extra arithmetic work • Example: prefix computation - input: {ai}, i=0..n, output: pi=∏k={0..i} ak, ∀i=0..n

ArTeCS / oct 2009 29 jeudi 12 novembre 2009 Cost of task • Cost = Creation + Extra arithmetic work • Example: prefix computation - input: {ai}, i=0..n, output: pi=∏k={0..i} ak, ∀i=0..n - sequential work:

- Tseq=n

ArTeCS / oct 2009 29 jeudi 12 novembre 2009 Cost of task • Cost = Creation + Extra arithmetic work • Example: prefix computation - input: {ai}, i=0..n, output: pi=∏k={0..i} ak, ∀i=0..n - sequential work:

- Tseq=n

- Parallel program => Ladner-Fisher (divide & conquer)

- T∞=2 log2 n, T1=2 n

ArTeCS / oct 2009 29 jeudi 12 novembre 2009 Cost of task • Cost = Creation + Extra arithmetic work • Example: prefix computation - input: {ai}, i=0..n, output: pi=∏k={0..i} ak, ∀i=0..n - sequential work:

- Tseq=n

- Parallel program => Ladner-Fisher (divide & conquer)

- T∞=2 log2 n, T1=2 n

- Fish’s lower bound: any parallel algorithm with critical path log2 n requires at least 4n operations

ArTeCS / oct 2009 29 jeudi 12 novembre 2009 Adaptive algorithm • [Roch, Traoré 07], [Roch, Traoré, Gautier 08] - Principe: create tasks when processors are idle ! • Task = apply F(ai) for all elements ai of an array

Thread 1

ArTeCS / oct 2009 30 jeudi 12 novembre 2009 Adaptive algorithm • [Roch, Traoré 07], [Roch, Traoré, Gautier 08] - Principe: create tasks when processors are idle ! • Task = apply F(ai) for all elements ai of an array

Thread 1 Thread 2

ArTeCS / oct 2009 30 jeudi 12 novembre 2009 Adaptive algorithm • [Roch, Traoré 07], [Roch, Traoré, Gautier 08] - Principe: create tasks when processors are idle ! • Task = apply F(ai) for all elements ai of an array

Thread 1 Thread 2

ArTeCS / oct 2009 30 jeudi 12 novembre 2009 Adaptive algorithm • [Roch, Traoré 07], [Roch, Traoré, Gautier 08] - Principe: create tasks when processors are idle ! • Task = apply F(ai) for all elements ai of an array

Update Task / Thread 1 New Task / Thread 2

Thread 1 Thread 2

ArTeCS / oct 2009 30 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Update Task / Thread 1 New Task / Thread 2

Thread 1 Thread 2

ArTeCS / oct 2009 31 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Update Task / Thread 1 New Task / Thread 2

Thread 1 Thread 2

ArTeCS / oct 2009 31 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Update Task / Thread 1 New Task / Thread 2

Thread 1 Thread 2

ArTeCS / oct 2009 31 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Update Task / Thread 1 New Task / Thread 2

Thread 1 Thread 2

ArTeCS / oct 2009 32 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Thread 1 Thread 2

ArTeCS / oct 2009 32 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Preemption

Thread 1 Thread 2

ArTeCS / oct 2009 32 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do

Update Task / Thread 1

Preemption

Thread 1 Thread 2

ArTeCS / oct 2009 32 jeudi 12 novembre 2009 Adaptive algorithm • Case of heterogeneous speed - Thread 2 is slower or as more work to do - We always use preemption: main thread (sequential algorithm) preempts thieves Update Task / Thread 1

Preemption

Thread 1 Thread 2

ArTeCS / oct 2009 32 jeudi 12 novembre 2009 Algorithmic point of view

• 2 Algorithms - Sequential: efciency - Parallel: to split work & merge partial results • Scheduler - interleaved execution of the two algorithms depending on the idle CPUs - robust to heterogeneous processors

ArTeCS / oct 2009 33 jeudi 12 novembre 2009 Workqueue optimization • 3 operations - push / pop / steal • Algorithms - Cilk: T.H.E. protocol - serialization of thieves to a same victim - thief/victim atomic read/write + in rare case - ABP [SPAA00]: - lock free (Compare&Swap), but prone to overflow - Chase & Lev [SPAA05]: - without limitation (other than hardware)

➡ C O S T L Y ‘cas’ operation [PPoPP09] ArTeCS / oct 2009 34 jeudi 12 novembre 2009 Role of the ‘CAS’

• Consensus between N-thieves and the victim - In theory, not possible (if wait free) with less powerful synchronization primitive - e.g.: atomic register read/write, test&set

• How to avoid ‘CAS’ ?

ArTeCS / oct 2009 35 jeudi 12 novembre 2009 Relaxed semantics

• “Idempotent work stealing” [PPoPP09] - Maged M. Michael, Martin T. Vechev, - Vijay A. Saraswat (work also on X10 language) - avoid CAS in pop operation ➡ More Performance

• Drawback - a task is returned (and executed) at least once - ... instead of exactly once

ArTeCS / oct 2009 36 jeudi 12 novembre 2009 Cooperative approach • [X. Besseron, C. Laferrière] • Keep same semantics as usual - a task is extracted exactly once • so...avoid concurrency between victim & thieves - the victim interrupts its work to process steal requests - some similarity with TasCell [09], Capsule [06] • Drawback - the victim should poll requests - thieves are waiting ArTeCS / oct 2009 37 jeudi 12 novembre 2009 Cooperative WS

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

01

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

21

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

2

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

2

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

2

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

12

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

12

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Cooperative WS test ?

0

•Gain - work overhead: 1 read on memory - reply to K thieves in place of only 1 - better workload balance ArTeCS / oct 2009 38 jeudi 12 novembre 2009 Outline

• Athapascan / Kaapi a software stack • Foundation of work stealing • Controls of the overheads • Experiments with STL algorithms

ArTeCS / oct 2009 39 jeudi 12 novembre 2009 KaSTL: Parallel STL • [PhD Daouda Traoré, 2008] • Technics applied to 95% of STL algorithms - STL: Standard C++ Template Library • Comparison with other library - Cilk++, Intel Thread Building Block (TBB) - MCSTL (GNU STL parallelized with OpenMP) • Multiprocessor : 8 AMD CPUs with 2 cores

ArTeCS / oct 2009 40 jeudi 12 novembre 2009 Experiments • Two set of experiments - 1/ with adaptive algorithms - PhD of [Daouda Traoré] - 2/ with cooperative work stealing - [Daouda Traoré, Xavier Besseron, Christophe Laferrière] • Methodology - average over 30 runs - 1 run = average of 100 basic experiments, do not take into account the first experiment

ArTeCS / oct 2009 41 jeudi 12 novembre 2009 Prefix Homogeneous processors •- * coarse grain, 30000 elements

ArTeCS / oct 2009 42 jeudi 12 novembre 2009 Prefix Heterogenous •- (p-1) processors have the same speed S - 1 processor has speed S/2

Optimal prefix with static partition of the array

ArTeCS / oct 2009 43 jeudi 12 novembre 2009 Sort ~ 100M elements, 1s sequential time • sort ! medium size (~1s) ! speedup 10 8 6 speedup 4

KASTL Cilk++

2 KAAPI TBB CAPSULE

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16

number of processors ArTeCS / oct 2009 44 jeudi 12 novembre 2009 Partial conclusion

• Interest to adapt the task to activity (or idleness) of processors

• But coarse computation - sequential time > 0.1s

➡ + Cooperative work stealing - named X-KaSTL or CKaapi in diagrams

ArTeCS / oct 2009 45 jeudi 12 novembre 2009 std::transform • 1 processor

ArTeCS / oct 2009 46 jeudi 12 novembre 2009 std::transform • 8 processors NUMA machine 7 KaSTL TBB Cilk X-KaSTL 6 STL

5

4

Tstl / Tlibrary 3

2 3.78e-05s

1

0 0 13000 50000 100000 150000 200000 250000 300000 Size ArTeCS / oct 2009 47 jeudi 12 novembre 2009 std::merge • 1 processor

ArTeCS / oct 2009 48 jeudi 12 novembre 2009 std::merge • 8 processors

ArTeCS / oct 2009 49 jeudi 12 novembre 2009 Partial conclusion

• Efective speedup at finer computation - sequential time ~ 10-3s = 1ms

➡ impact of the better balance of work load - comparison split in K versus split in 2

ArTeCS / oct 2009 50 jeudi 12 novembre 2009 “Transform” 2 threads

Time for algo transform and 2 threads 0.0035 CKASTL CKASTL2 Split in K TBB 0.003 STL (1 thread)

0.0025 Split in 2

0.002

0.0015 Time (second)

0.001

0.0005

0 0 50000 100000 150000 200000 250000 300000 Problem size

ArTeCS / oct 2009 51 jeudi 12 novembre 2009 “Transform” 4 threads

Time for algo transform and 4 threads 0.0035 CKASTL CKASTL2 Split in K TBB 0.003 STL (1 thread)

0.0025 Split in 2

0.002

0.0015 Time (second)

0.001

0.0005

0 0 50000 100000 150000 200000 250000 300000 Problem size

ArTeCS / oct 2009 52 jeudi 12 novembre 2009 “Transform” 8 threads

Time for algo transform and 8 threads 0.0035 CKASTL CKASTL2 Split in K TBB 0.003 STL (1 thread)

0.0025 Split in 2

0.002

0.0015 Time (second)

0.001

0.0005

0 0 50000 100000 150000 200000 250000 300000 Problem size

ArTeCS / oct 2009 53 jeudi 12 novembre 2009 “Merge”, 8 threads

Time for algo merge and 8 threads 0.007 CKASTL CKASTL2 TBB 0.006 Split in K STL (1 thread)

0.005 Split in 2

0.004

0.003 Time (second)

0.002

0.001

0 0 50000 100000 150000 200000 250000 300000 Problem size

ArTeCS / oct 2009 54 jeudi 12 novembre 2009 “Merge”, 8 threads

Time for algo merge and 8 threads 0.007 CKASTL CKASTL2 TBB 0.006 Split in K STL (1 thread)

0.005 Split in 2

0.004 Similar perf. because the algorithm

0.003 always splits work in 2 Time (second)

0.002

0.001

0 0 50000 100000 150000 200000 250000 300000 Problem size

ArTeCS / oct 2009 54 jeudi 12 novembre 2009 Conclusion • Athapascan/Kaapi - a data flow model with lazy task generation - reduce the work overhead - sequential semantics - original approach - efective parallelization of fine computation - with cooperative workstealing • http://kaapi.gforge.inria.fr/ • Drawback - non standard research software - difculty to port already parallelized applications ArTeCS / oct 2009 55 jeudi 12 novembre 2009 Perspectives

• Software being to be more robust - INRIA action to support development (2 years) - Ported on top of most Unix (Linux, MacOSX, [SunOS], iPhoneOS] - [2010] will be ported on IBM/BlueGene - [2010] will be ported on MPSoC with ST Microelectronics

• Viability of the cooperative approach - not yet theoretical fundation - coupling technics with concurrent work stealing

ArTeCS / oct 2009 56 jeudi 12 novembre 2009 Perspectives

• Mixing CPUs & GPUs - preliminary work - deeper integration of the GPU as a processing resource - next Fermi GPU + driver ? • Better coupling between OS & Middleware - importance to know (in advance) the available number of resources • Taking into account NUMA architecture - ongoing work at MOAIS [JN Quintin, PhD]

ArTeCS / oct 2009 57 jeudi 12 novembre 2009 Other applications • [Parallelization of Bayesian computation] • Parallel Computer Algebra - Linbox: http://www.linalg.org • Combinatorial Opt. [PRiSM (Paris), B. Lecun] • Academic applications - [III, IV, V Grid@Work contest] - NQueens - Option Pricing application based on Monte Carlo Simulation - Numerical kernel for CEM, CFD Grid application - Finite diference / Finite element - Reaction / difusion with Chemical species - Finite diference • SOFA (http://www-sofa-framework.org) ArTeCS / oct 2009 58 jeudi 12 novembre 2009 59 jeudi 12 novembre 2009 59 jeudi 12 novembre 2009 NQueens [2006,2007]

• Grid5000 (French academic national grid) - 2006: N=23 in 74min on 1422 cores - 2007: N=23 in 35mn 7s on 3654 cores • Taktuk: fast deployment tool ArTeCS / oct 2009 60 jeudi 12 novembre 2009 Monte Carlo / Option Pricing

Grid5000: France

• 3609 cores: ~2700 Grid5000 ~900 Intrigger • SSH connection between Japan-France

ArTeCS / oct 2009 61 jeudi 12 novembre 2009 Monte Carlo / Option Pricing

Grid5000: France

Intrigger: Japan • 3609 cores: ~2700 Grid5000 ~900 Intrigger • SSH connection between Japan-France

ArTeCS / oct 2009 61 jeudi 12 novembre 2009 Physics Simulation

• SOFA: real-time physics engine • Strongly supported INRIA initiative • Open Source: http://www.sofa-framework.org • Target application: Surgery simulation

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Physics Simulation

• SOFA: real-time physics engine • Strongly supported INRIA initiative • Open Source: http://www.sofa-framework.org • Target application: Surgery simulation

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Physics Simulation

• SOFA: real-time physics engine • Strongly supported INRIA initiative • Open Source: http://www.sofa-framework.org • Target application: Surgery simulation

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Iterative Application • Scheduling by graph partitioning • Metis / Scotch

Application

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Domain decomposition

Network

co co co co co co co co co co co co co co co co co co co co co co co co ca ca ca ca ca ca ca ca ca ca ca ca Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem dual processor dual processor dual core dual core NUMA Non Uniform Memory Access

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Domain decomposition

Graph partitioner - scotch - metis - hierarchical : ANR DISCOGRID

Network

co co co co co co co co co co co co co co co co co co co co co co co co ca ca ca ca ca ca ca ca ca ca ca ca Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem dual processor dual processor dual core dual core NUMA Non Uniform Memory Access

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Experiments • Finite Difference Kernel • Kaapi / C++ code versus Fortran MPI code • Constant size sub domain D per processor • Cluster : N processors on a cluster • Grid : N/4 processors per cluster, 4 clusters

D=256^3 # processors Cluster (s) Grid (s) Overhead 1 0.49 0.49 - KAAPI 64 0.55 0.84 0,53 128 0.65 0.91 0,4 1 0.44 0.44 - MPI 64 0.66 2.02 2,06 128 0.68 1.57 1,31 Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Optimizing MPI code

• Overlapping communication by computation • At the cost of important code restructuring

256^3/proc between Rennes and Bordeaux 1.8 kaapi−opt sendrecv−ompi 1.6 irecvisend−ompi async−ompi

1.4

1.2

1

0.8

0.6 Mean time for an iteration (s)

0.4

0.2

0 16+16 32+32 64+64 Nb proc Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Optimizing MPI code

• Overlapping communication by computation • At the cost of important code restructuring

256^3/proc between Rennes and Bordeaux 1.8 kaapi−opt sendrecv−ompi 1.6 irecvisend−ompi async−ompi

1.4 1.2 KAAPI automatically 1 reschedules computation 0.8 and communication tasks 0.6 Mean time for an iteration (s)

0.4

0.2

0 16+16 32+32 64+64 Nb proc Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 Communication

• Active message like communication protocol • Multi-network (TCP, Myrinet, ssh tunnel with TakTuk) • High capacity to overlap communication by computations • Original message aggregation protocol

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project jeudi 12 novembre 2009 68 jeudi 12 novembre 2009 68 jeudi 12 novembre 2009