<<

Lecture 15-16: Parallel Programming with

Concurrent and Mulcore Programming

Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan

1 Topics (Part 1)

• Introducon • Principles of design (Chapter 3) • Programming on system (Chapter 7) – OpenMP – Cilk/Cilkplus – PThread, mutual exclusion, locks, synchronizaons • Analysis of parallel program execuons (Chapter 5) – Performance Metrics for Parallel Systems • Execuon Time, Overhead, , Efficiency, Cost – of Parallel Systems – Use of performance tools

2 Shared Memory Systems: Mulcore and Mulsocket Systems • a

3 Threading on Shared Memory Systems

• Employ parallelism to compute on shared data – performance on a fixed memory footprint (strong scaling) • Useful for hiding latency – e.g. latency due to I/O, memory latency, communicaon latency • Useful for and load balancing – especially for dynamic concurrency • Relavely easy to program – easier than message-passing? you be the judge!

4 Programming Models on Shared Memory System • -based models – All data are shared – Pthreads, Threading Building Blocks, Java Concurrency, Boost, Microso .Net Task Parallel Library • Direcve-based models, e.g., OpenMP – shared and private data – pragma syntax simplifies creaon and synchronizaon • Programming languages – CilkPlus (Intel, GCC), and MIT Cilk – CUDA () – OpenCL

5 Toward Standard Threading for /C++

At last month's meeting of the C standard committee, WG14 decided to form a study group to produce a proposal for language extensions for C to simplify parallel programming. This proposal is expected to combine the best ideas from Cilk and OpenMP, two of the most widely-used and well-established parallel language extensions for the C language family.

As the chair of this new study group, named CPLEX (C Parallel Language Extensions), I am announcing its organizational meeting:

June 17, 2013 10:00 AM PDT, 2 hours

Interested parties should join the group's mailing list, to which further information will be sent: http://www.open-std.org/mailman/listinfo/cplex

Questions can be sent to that list, and/or to me directly. -- Clark Nelson Vice chair, PL22.16 (ANSI C++ standard committee) Intel Corporation Chair, SG10 (WG21 study group for C++ feature-testing) [email protected] Chair, CPLEX (WG14 study group for C parallel language extensions)

6 OpenMP

• Idenfy stac mapping and scheduling of tasks and cores – Before tasking • No need to create thread manually • Sequenal code migraon to parallel code by inserng direcves • Opmizaon for memory and synchronizaon are the key – Reduce memory contenon and parallelism overhead

• Users achieve both problem decomposion into tasks and mapping tasks to hardware – OpenMP worksharing 1:1 mapping

7 Outline for Cilk/Cilkplus

• Introducon and Basic Cilk Programming • Cilk Work-stealing Scheduler • Implementaon Strategies • Performance Analysis • Scheduling Performance Analysis • More Examples

8 Cilk/Cilkplus Summary

• A simpler model for wring parallel programs – Focusing on problem decomposion • What computaon can be performed in parallel – Runme perform the mapping • Extends C/C++ with two main keywords à tasking – spawn: invoke a funcon (potenally) in parallel – sync: wait for a procedure’s spawned funcons to finish • Faithful language extension – if Cilk/Cilkplus keywords are elided → C/C++ program semancs • The idea has been adopted by OpenMP with task – omp task – omp taskwait

9 Availability

• Cilk and Cilkplus – Cilk is originally developed by MIT Charles E. Leiserson • hp://supertech.csail.mit.edu/cilk/ – Cilkplus is commercialized now from Intel: cilk_spawn and cilk_sync • Added cilk_for, parallel execuon of a for loop • Availability – MIT Cilk – Intel , GCC 4.9

• lennon.secs.oakland.edu

10 Cilk Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hps://en.wikipedia.org/wiki/Fibonacci_number 11 Fibonacci (MIT Cilk)

int fib (int n) { if (n<2) return (n); else { Cilk code int x,y; cilk int fib (int n) { x = fib(n-1); if (n<2) return (n); y = fib(n-2); else { return (x+y); int x,y; } x = spawn fib(n-1); } y = spawn fib(n-2); sync; C elision return (x+y); } }

A Cilk program’s serial elision is always a legal implementaon of Cilk semancs. Cilk provides no new data types.

12 Basic Cilk Keywords

Idenfies a funcon as a Cilk procedure, capable of cilk int fib (int n) { being spawned in parallel. if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); The named child Cilk } procedure can execute } in parallel with the parent caller.

Control cannot pass this point unl all spawned children have returned. 13 Dynamic Multhreading cilk int fib (int n) { if (n<2) return (n); else { Example: fib(4) int x,y; x = spawn fib(n-1); y = spawn fib(n-2); 4 sync; return (x+y); } 3 2 }

2 1 1 0

1 0 The computaon dag unfolds dynamically. 14 Mapping Tasks to Hardware

• Cilk allows the to express potenal parallelism in an applicaon. – Many tasks

• The Cilk scheduler maps Cilk tasks onto processors dynamically at runme P P P – A thread in this context is a $ $ … $ PE Network

Memory I/O

15 Outline for Cilk/Cilkplus

• Introducon and Basic Cilk Programming • Cilk Work-stealing Scheduler • Implementaon Strategies • Performance Analysis • Scheduling Performance Analysis • More Examples

16 Scheduling Tasks in Cilk

• Lazy parallelism – Put off work for parallel execuon unl necessary • E.g. no need for parallel execuon when no enough PEs • Work-stealing – Mulple PEs share work (tasks) • A PE looks for work in other PEs when it becomes idle – Any PE can create work (tasks) via spawn

17 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

Spawn! P P P P

18 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

Spawn! Spawn! P P P P

19 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

Return! P P P P

20 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

Return! P P P P

21 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop hps://en.wikipedia.org/wiki/Double-ended_queue

Steal! P P P P When a processor runs out of work, it steals a task from the top of a random

vicm’s deque. 22 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

Steal! P P P P When a processor runs out of work, it steals a task from the top of a random

vicm’s deque. 23 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

P P P P When a processor runs out of work, it steals a task from the top of a random

vicm’s deque. 24 Cilk’s Work-Stealing Scheduler

• Each PE maintains a work deque of ready tasks, and it manipulates the boom of the deque like a stack. – Push and pop

Spawn! P P P P When a processor runs out of work, it steals a task from the top of a random

vicm’s deque. 25 Dynamic Multhreading int fib (int n) { if (n<2) return (n); else { int x,y;

x = _Cilk_spawn fib(n-1);

y = _Cilk_spawn fib(n-2);

_Cilk_sync;

return (x+y); } }

26 Dynamic Multhreading int fib (int n) { if (n<2) return (n); else { int x,y;

x = cilk_spawn fib(n-1);

y = cilk_spawn fib(n-2);

cilk_sync;

return (x+y); } } Connuaon

27 Workstealing State on both Program Stack an Dequeu

• At a fork point, add tasks to P0 P1 the tail of the current worker’s deque and execute program stack join program stack the other task. fib(4) fib(3) • If idle, steal work from the head of a random other fib(2) fib(1) worker’s deque. fib(0) spawn • When at an incomplete join spawn point, pop work off the tail steal of the worker’s own deque task deque ! sync task deque (reverse fork). fib(3) fib(2) • If worker’s own deque is fib(1) empty, either stall at join, or do a random steal.

Pablo Halpern, 2015 (CC BY 4.0) 28 Child-Stealing

• At a spawn, the child task in pushed onto the worker’s deque. – A task data structure is allocated on the heap – Everything needed to run the child is stored in the task data structure – A pointer to the task data structure is pushed onto the deque • The worker then executes the fork connuaon immediately. • An idle worker can steal the child task. • If the child task is not stolen, it is run by the original worker when it reaches the join point. • Typically, the scheduler stalls at the join point if there are stolen children that have not completed.

Pablo Halpern, 2015 (CC BY 4.0) 29 Connuaon Stealing

• At a spawn, the connuaon in pushed onto the worker’s deque. – Registers are saved on the stack. – A pointer to the current stack frame is pushed onto the deque • The worker then executes the child immediately, as if it were a normal call. • An idle worker can steal the connuaon task. • Upon compleng the child, if the connuaon (parent) has not been stolen, the original worker connues as if returning from a normal funcon call. • The join connuaon is run by whichever worker completes its task last. – Typically, no worker stalls at the join point. – The worker running aer the join might be different than the one entering it.

Pablo Halpern, 2015 (CC BY 4.0) 30 Advantages of Child stealing over connuaon Stealing Both are types of . Connuaon stealing has a number of praccal advantages, however: • Child stealing libraries can be implemented without special support; connuaon stealing typically requires compiler support. • At each fork and spawn point, a connuaon stealing implementaon might switch to a different worker thread, confusing code that depends on thread-local storage.

Pablo Halpern, 2015 (CC BY 4.0) 31 Advantages of connuaon stealing over Child Stealing Conversely connuaon stealing has many theorecal advantages of connuaon stealing: • Queue size bounded by recursion depth & stack space bound to P mes serial stack usage vs. unbounded queue size for child stealing. • On a single worker, connuaon stealing produces idencal execuon to serial code; child stealing produces a scrambled execuon order. • Naturally lends itself to non-stalling join points making it closer to an ideal greedy scheduler. • Certain features are easier to implement efficiently on top of a connuaon-stealing scheduler, for example: associave reducons.

Pablo Halpern, 2015 (CC BY 4.0) 32 Advantages of connuaon stealing over Child Stealing Conversely connuaon stealing has many theorecal advantages of child stealing: • Queue size bounded by recursion depth & stack space bound to P mes serial stack usage vs. unbounded queue size for child stealing. • On a single worker, connuaon stealing produces idencal execuon to serial code; child stealing produces a scrambled execuon order. • Naturally lends itself to non-stalling join points making it closer to an ideal greedy scheduler. • Certain features are easier to implement efficiently on top of a Only Monsters connuaon-stealing scheduler, for example: associave reducons. Steal Children

Pablo Halpern, 2015 (CC BY 4.0) 33 Outline for Cilk/Cilkplus

• Introducon and Basic Cilk Programming • Cilk Work-stealing Scheduler • Implementaon Strategies • Performance Analysis • Scheduling Performance Analysis • More Examples

34 Compiling spawn — Fast Clone Cilk frame source x = spawn fib(n-1); entry join cilk2c n x y frame->entry = 1; suspend entry frame->n = n; parent push(frame); join

x = fib(n-1); run child C post- Cilk source if (pop()==FAILURE) { deque frame->x = x; resume frame->join--; parent h clean up & remotely return to scheduler i } 35 Compiling sync — Fast Clone

Cilk SLOW source sync; FAST FAST cilk2c FAST FAST C post- FAST source ;

No synchronizaon overhead in the fast clone!

36 Compiling the Slow Clone void fib_slow(fib_frame *frame) { int n,x,y; switch (frame->entry) { frame case 1: goto L1; restore entry case 2: goto L2; program join case 3: goto L3; counter } n !" frame->entry = 1; x frame->n = n; y push(frame); x = fib(n-1); same entry if (pop()==FAILURE) { as fast frame->x = x; join frame->join--; clone h clean up & return to scheduler i } Cilk restore local deque if (0) { L1:; variables n = frame->n; if resuming } !" connue } 37 Project Accounts

• On orion.ec.oakland.edu – Need VPN to access from home

• Accont is the same as your ne – Password: <first four leers of your ned>1234 – Change it the first me you login

• Follow development setup steps to clone the OpenMP runme repo and examples repo

38 Outline for Cilk/Cilkplus

• Introducon and Basic Cilk Programming • Cilk Work-stealing Scheduler • Implementaon Strategies • Performance Analysis • Scheduling Performance Analysis • More Examples

39 Multhreaded Computaon

inial task final task

connue edge return edge spawn edge

• The dag G = (V, E) represents a parallel instrucon stream. • Each vertex v of V represents a (Cilk) task: a maximal sequence of instrucons not containing parallel control (spawn, sync, return). • Every edge e of E is either a spawn edge, a return edge, or a connue edge. 40 Algorithmic Complexity Analysis

TP = execuon me on P processors • Computaon graph abstracon: – node = arbitrary sequenal computaon – edge = dependence (successor node can only execute aer predecessor node has completed) – Directed Acyclic Graph (DAG) • Processor abstracon: – P idencal processors – each processor executes one node at a me

41 Algorithmic Complexity Analysis

TP = execuon me on P processors

42 Algorithmic Complexity Analysis

TP = execuon me on P processors

T1 = work

43 Algorithmic Complexity Analysis

TP = execuon me on P processors

T1 = work

T∞ = span*

* Also called crical-path length or computaonal depth. 44 Algorithmic Complexity Analysis

TP = execuon me on P processors

T1 = work

T∞ = span*

LOWER BOUNDS

• TP >= T1/P • TP >= T∞ * Also called crical-path length or computaonal depth. 45 Speedup

Definion: T1/TP = speedup on P processors.

If T1/TP = Θ(P), we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP >= T1/P.

46 Parallelism and Parallel Slackness

• We have the lower bound TP >= T∞ and TP >= T1/P

• The maximum possible speedup given T∞ and T1, i.e. the parallelism – Independent of P, only depend on the graph P = T1/T∞

• Parallel slackness (Efficiency) as the rao

(T1/T∞ )/P

– The larger the efficiency, the less the impact of T∞ on performance

47 Example: fib(4)

1 8

2 7

3 4 6

5

Assume for simplicity that each Cilk task in fib() takes unit me to execute.

Work: T1 = 17 Using many more than 2 processors Span: T∞ = 8 makes lile sense.

Parallelism: T0/T∞ = 2.125 48 Parallelizing Vector Addion

void vadd (real *A, real *B, int n){ C int i; for (i=0; i

49 Parallelizing Vector Addion

void vadd (real *A, real *B, int n){ C int i; for (i=0; i

void vadd (real *A, real *B, int n){ C if (n<=BASE) { int i; for (i=0; i

} }

Parallelizaon strategy: 1. Convert loops to recursion.

50 Parallelizing Vector Addion

void vadd (real *A, real *B, int n){ C int i; for (i=0; i

voidcilk vadd (real *A, real *B, int n){ C ilk if (n<=BASE) { int i; for (i=0; i

Parallelizaon strategy: Side benefit: 1. Convert loops to recursion. divide and conquer is 2. Insert Cilk keywords. generally good for caches! 51 Vector Addion cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i

52 Vector Addion Analysis To add two vectors of length n, where BASE = Θ(1):

Work: T1 = ? Θ(n)

Span: T∞ = ? Θ(log n)

Parallelism: T1/T∞ = ? Θ(n/log n)

BASE 53 Outline for Cilk/Cilkplus

• Introducon and Basic Cilk Programming • Cilk Work-stealing Scheduler • Implementaon Strategies • Performance Analysis • Scheduling Performance Analysis • More Examples

54 Analysis: Greedy Scheduling

IDEA: Do as much as possible on every step. Definion: A task is ready if all its predecessors have executed.

55 Greedy Scheduling

IDEA: Do as much as possible on every step. Definion: A task is ready if all its predecessors have executed. P = 3

Complete step • >= P tasks ready. • Run any P.

56 Greedy Scheduling

IDEA: Do as much as possible on every step. Definion: A task is ready if all its predecessors have executed. P = 3

Complete step • >= P tasks ready. • Run any on P.

Incomplete step • < P tasks ready. • Run all of them. 57 Greedy-Scheduling Theorem Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves TP ≤ T1/P + T∞ P = 3 Proof. • # complete steps <= T1/P, since each complete step performs P work.

• # incomplete steps <= T∞, since each incomplete step reduces the span of the unexecuted dag by 1.

58 Performance of Work-Stealing Theorem: On P processors, Cilk’s work- stealing scheduler achieves an expected running me of

TP = T1/P + O(T∞ )

work term Crical path term

59 Crical Path Overhead

• Crical path overhead = smallest constant C∞ such that

60 Work Overhead

• You

61 Breakdown of Work Overhead

C state saving MIPS R10000 115ns frame allocaon stealing protocol UltraSPARC I 113ns

Penum Pro 78ns

Alpha 21164 27ns

0 1 2 3 4 5 6 7 Benchmark: fib on one processor. T1/TS The average cost of a spawn in Cilk-5 is only 2–6 mes the cost of an ordinary C funcon call, depending on the plaorm. 62 Outline for Cilk/Cilkplus

• Introducon and Basic Cilk Programming • Cilk Work-stealing Scheduler • Implementaon Strategies • Performance Analysis • Scheduling Performance Analysis • More Examples

63 Square-Matrix Mulplicaon

c11 c12 # c1n a11 a12 #" a1n b11 b12 #" b1n c21 c22 # c2n a21 a22 #" a2n b21 b22 #" b2n X ! ! $ ! = !" !" $" !" !" !" $" !" cn1 cn2 # cnn an1 an2 #" ann bn1 bn2 #" bnn

C A B n

cij = ∑ aik bkj k = 1

Assume for simplicity that n = 2k. Recursive Matrix Mulplicaon

Divide and conquer —

C11 C12 A11 A12 B11 B12 = X C21 C22 A21 A22 B21 B22

A11B11 A11B12 A12B21 A12B22 = + A21B11 A21B12 A22B21 A22B22

8 mulplicaons of (n/2) x (n/2) matrices. 1 addion of n x n matrices. Matrix Mulplicaon cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & paron matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } Work of Mulply

cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & paron matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; }

3 Work: T1(n) = ? Θ(n ) Span of Mulply

cilk void MultA(*C, *A, *B, n) { // C = C + A * B h base case & paron matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); maximum spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); maximum spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; }

Span: T∞(n) = 2 ? T∞ (n/2) + Θ(1) = Θ(n) 3 2 Parallelism: = T1/T∞ = Θ(n ) / Θ(n) = Θ(n ) Merging Two Sorted Arrays void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } Time to merge n } while (na>0) { elements = ? Θ (n). *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } }

3 12 19 46

4 14 21 23 Merge Sort

cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } 3 4 12 14 19 21 33 46 merge 3 12 19 46 4 14 21 33 merge 3 19 12 46 4 33 14 21 merge 19 3 12 46 33 4 21 14 Work of Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } }

Work: T1(n) = ? 2 T1(n/2) + Θ(n) = Θ(n lg n) Span of Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } }

Span: T∞(n) = ? T∞ (n/2) + Θ(n) = Θ(n)

T1(n) Parallelism: = Θ(lg n) T∞ (n) Tableau Construcon

Problem: Fill in an n x n tableau A, where A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ). 00 01 02 03 04 05 06 07 Dynamic 10 11 12 13 14 15 16 17 programming 20 21 22 23 24 25 26 27 • Longest common subsequence 30 31 32 33 34 35 36 37 • Edit distance 40 41 42 43 44 45 46 47 • Time warping 50 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 Work: Θ(n2). 70 71 72 73 74 75 76 77 Recursive Construcon n

Cilk code

I II spawn I; sync; n spawn II; spawn III; sync; III IV spawn IV; sync; Recursive Construcon n

Cilk code

I II spawn I; sync; n spawn II; spawn III; sync; III IV spawn IV; sync;

Work: T1(n) = ? 4T1(n/2) + Θ(1) = Θ(n2) Recursive Construcon n

Cilk code

I II spawn I; sync; n spawn II; spawn III; sync; III IV spawn IV; sync;

lg3 Span: T∞ (n) = ? 3T∞ (n/2) + Θ(1) = Θ(n ) T1(n) Parallelism: ≈ Θ(n0.42) T∞ (n) A More-Parallel Construcon n spawn I; sync; spawn II; I II IV spawn III; sync; spawn IV; n spawn V; III V VII spawn VI sync; spawn VII; spawn VIII; VI VIII IX sync; spawn IX; sync; A More-Parallel Construcon n spawn I; sync; spawn II; I II IV spawn III; sync; spawn IV; n spawn V; III V VII spawn VI sync; spawn VII; spawn VIII; VI VIII IX sync; spawn IX; sync;

Work: T1(n) = ? 9T1(n/3) + Θ(1) = Θ(n2) A More-Parallel Construcon n spawn I; sync; spawn II; I II IV spawn III; sync; spawn IV; n spawn V; III V VII spawn VI sync; spawn VII; spawn VIII; VI VIII IX sync; spawn IX; sync;

log 5 Span: T∞ (n) = ? 5T∞ (n/3) + Θ(1) = Θ(n 3 ) Analysis of Revised Construcon

2 Work: T1(n) = Θ(n )

log35 Span: T∞ (n) = Θ(n ) ≈ Θ(n1.46)

T1(n) Parallelism: ≈ Θ(n0.54) T∞ (n)

More parallel by a factor of Θ(n0.54)/Θ(n0.42) = Θ(n0.12) . References

• “Introducon to Parallel Compung” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 • Charles E. Leiserson. Cilk LECTURE 1. Supercompung Technologies Research Group. Computer Science and Arficial Intelligence Laboratory. hp://bit.ly/mit-cilk-lec1 • Charles Leiserson, Bradley Kuzmaul, Michael Bender, and Hua-wen Jing. MIT 6.895 lecture - Theory of Parallel Systems. hp://bit.ly/mit-6895-fall03 • Intel Cilk++ Programmer’s Guide. Document # 322581-001US.

81