<<
Home , C99

;<=>?@A?<= Theater 411 !BCD&EA+&;FAG !"#$%$&'()*%+,-./'-0%1")#-(..(2$1 !"#$%&'$()*$'%+ 4.)%5-1")1&6'$1('$*&()1$-7 ")*,&-()+ *(%)$,&")(#"1(.) !.$&-.)/%.)+ *(%)$,&")(#"1(.) !$00&-"(1+ %(*$08 0-.'(")&2(12$-%3$'/$'+ 9(8"' #"'1()&*$&-"%"+ "51.*$%: Agenda 2:00pm - 2:40pm: Multithreading Introduction and Overview James Reinders 2:40pm - 3:25pm: Parallel Processing and Multi-Process Evaluation in a Host Application Andy Lin and Joe Longson 3:25pm: Break 3:35pm - 4:05pm: Threading SOPs in Houdini Jeff Lait 4:05pm - 4:35pm: Building a Scalable Evaluation Engine for Presto Florian Zitzelsberger 4:35pm - 5:15pm: Parallel Evaluation of Animated Maya Characters Martin De Lasa !"#$%&%'()*+,-.

///0"#$%&%'()*+&12*1+,-.03(2

!"#$%&%'()*+&,-./0(.1&2#*$.3//)4%25 “Think Parallel” “Think Parallel” – Parallelism is almost never effective to “stick in” at the last minute in a program – Think about everything in terms of how to do in parallel as cleanly as possible Motivation !"#$%&'($!&#)*$+

8B9C58DEF:5677C585GEF 677H5C5GEF:5<%&$$5%3+*IJ >3/)(5/*$$5K5LMN5/*$$5K5")"3(I5/*$$ <3$,)?5>*(*$$)$5'*(+/*()5K5).>$&A&%5>*(*$$)$5<3-%/*()5K5 <3-%/*()5")"3(I53>%&"&F*%&31

456789:5;*")<5=)&1+)(<:5#<)+5/&%'5>)("&<<&3105'%%>?@@$3%<3-A3()<0A3" 456789:5;*")<5=)&1+)(<:5#<)+5/&%'5>)("&<<&3105'%%>?@@$3%<3-A3()<0A3" O3()5*1+5P'()*+5O3#1%< Q&+%'

R&12$)5A3():5<&12$)5%'()*+:5(#$)+5-3(5+)A*+)<0 D#$%&%'()*+?52(3/5+&)5*()*5<"*$$5S5-3(5*++&%&315'*(+/*()5%'()*+T*(*$$)$&<"?5'*1+$&125"3()5+*%*5*%531A): "#$%&WI%):5"#$%&/3(+:5"*1I5/3(+<0 456789:5;*")<5=)&1+)(<:5#<)+5/&%'5>)("&<<&3105'%%>?@@$3%<3-A3()<0A3" N*(*$$)$5'*(+/*()5+)<&21<5A*15W) "3() )--&A&)1%5&-5%')I5A*15*<<#")5%'*% >(32(*"<5/&$$5#<)5)(-3("*1A)5/)$$0

456789:5;*")<5=)&1+)(<:5#<)+5/&%'5>)("&<<&3105'%%>?@@$3%<3-A3()<0A3" Task • Key thing to know: – Program in TASKS not THREADS.

http://tinyurl.com/threadsYUCK Task • Key thing to know: – Program in TASKS not THREADS.

This means: – Programmer’s job: identify (lots) of tasks to do – Runtime’s job: map tasks onto threads at runtime James’ BIG 3 REASONS to avoid programming to specific hardware mechanisms

1. Portability is impaired severely when coding “close to the hardware”

2. Nested parallelism is IMPORTANT (nearly impossible to manage well using “mandatory parallelism” methods such as threads)

3. Other parallelism mechanisms (e.g., vectorization) need to be considered and balanced. Parallel Programming • No widely used (popular) programming language was designed for expressing parallel programming. – Not Fortran, , C++, Java, C#, Perl, Python, Ruby

• This creates many challenges • Fundamental question of all programming languages: level of abstraction Programming Model Ideal Goals • Performance – achievable, scalable, predictable, tunable, portable • Productivity – expressive, composable, debugable, maintainable • Portability – functionality & performance across operating systems, compilers, targets Level of Abstraction: Parallelism • There is no “perfect” answer (one size fits all)

• Higher level programming (more abstract): – Desired benefits: More portable, more performance portable, better investment preservation over time. • Lower level programming (less abstract): – Desired benefits: More control for the programmer. Level of Abstraction: Parallelism • There is no “perfect” answer (one size fits all)

• Higher level programming (more abstract): – Desired benefits: TBB, OpenMP* More portable, more performance portable, better investment preservation over time. • Lower level programming (less abstract): – Desired benefits: OpenCL*, CUDA* More control for the programmer.

* Third party marks may be claimed as the property of others. Advancing C and C++ KKKRC9=476:53IG:>6:53I>BF;@RB=3 ! #RVI&

Z Y%')(51*")<5*1+5W(*1+<5"*I5W)5A$*&")+5*<5%')5>(3>)(%I53-53%')(<0 Easier to maintain, scales better, easier to debug (get correct) Work Stealing

Thread Thread Thread deque deque deque deque

mailbox mailbox mailbox mailbox

1. Take youngest task from my deque Locality 2. Steal task advertised in mailbox Cache Affinity 3. Steal oldest task from random victim Load balance Work Depth First; Steal Breadth First Best choice for theft! • big piece of work • data far from victim’s hot data.

Second best choice.

L2

L1

victim thread 1F7>4-.B=K7=6- %5C4>U-$9=476:53-NG:>6:53-N>BF;@-VRW-@F7>4@-4XF4HC:B57>>J-K%5C4>U-$9=476:53-NG:>6:53-N>BF;@-VRW-@F7>4@-4XF4HC:B57>>J-K4>> Parallel algorithms and data structures Threads and synchronization TBB Components Memory allocation and task scheduling

Generic Parallel Flow Graph Concurrent Containers Algorithms Concurrent access, and a scalable alternative to serial Efficient scalable way A set of classes to containers with external locking to exploit the power express parallelism as of multi-core without a graph of compute having to start from dependencies and/or Synchronization Primitives scratch. data flow Atomic operations, a variety of mutexes with different properties, condition variables

Task Scheduler Thread Local Storage Threads Miscellaneous

Sophisticated work scheduling engine that Unlimited number of OS API Thread-safe timers empowers parallel algorithms and the flow thread-local variables wrappers and exception graph classes

Memory Allocation Scalable memory manager and false-sharing free allocators without lambda, code has to go in a class class ApplyABC { public: float *a; float *b; float *c; ApplyABC(float *a_,float *b_,float *c_):a(a_), b(b_), c(c_) {} void operator()(const blocked_range& r) const { for(size_t i=r.begin(); i!=r.end(); ++i) a[i] = b[i] + c[i]; } }; void ParallelFoo( float* a, float *b, float *c, int n ) { parallel_for(blocked_range(0,n,10000),ApplyABC(a,b,c) );

putting code in a class is annoying doing with lambdas support is more natural void ParallelApplyFoo(size_t n, int x) { parallel_for( blocked_range(0,n,1000), [&]( const blocked_range& r ) -> void { for( size_t i=r.begin(); i!=r.end(); ++i ) a[i] = b[i] + c[i]; }); }

much easier to teach/read with lambdas Intel® TBB Class Graph: Components Interesting “new” aspect since TBB 4.0 Release (2011)

• Graph object Graph object == graph handle – Contains a pointer to the root task – Owns tasks created on behalf of the graph Graph node – Users can wait for the completion of all tasks of the graph • Graph nodes – Implement sender and/or receiver interfaces Edge – Nodes manage messages and/or execute function objects • Edges – Connect predecessors to successors TBB threadingbuildingblocks.org • Reasons it might matter: – Everywhere (ports, open source) – Forever (commercial and open source) – Performance Portable • Problems: – C++ not C ™ Plus cilkplus.org • Reasons it might matter: – Space/time guarantees – Performance Portable – C++ and C – Keywords bring compiler into the “know” – “Parent stealing” – Vectorization help too (array notations, elem. func, simd)

• Problems: – Requires compiler changes – Not feature rich – Only in Intel and gcc compilers (and some in /LLVM) – Standards adoption still “future” OpenMP* openmp.org • Reasons it might matter: – Everywhere (all major compilers) – Solutions for Tasking, vectorization, offload – Performance Portable • Problems: – C and Fortran, not so much C++ – Not composable – Not always in-sync with language standards

* Third party marks may be claimed as the property of others. OpenCL* khronos.org/opencl • Reasons it might matter: – Explicit heterogeneous controls – Everywhere (ports) – Non-proprietary – Underpinning for tools and libraries • Problems: – Low level – Not performance portable

* Third party marks may be claimed as the property of others. Choice is Good • Our favorite programming languages were NOT DESIGNED for parallelism.

• They need HELP.

• Multiple approaches and options are NEEDED. Vectorization: Who, What, Why, When A moment of clarity

task parallelism doesn’t matter without data parallelism

(James’ observation about what Amdahl and Gustafson were ultimately pointing out) Parallel first

Vectorize second #BCIJIKLMNOJPQ JV =RLM

r vector data operations: data operations done in parallel void v_add (float *c, float *a, float *b) { for (int i=0; i<= MAX; i++) c[i]=a[i]+b[i]; } D4FCB=-67C7-BH4=7C:B5@P 67C7-BH4=7C:B5@-6B54-:5-H7=7>>4> void v_add (float *c, float *a, float *b) M33>? { 80 MY\V5*f&g5bh5=* for (int i=0; i<= MAX; i++) c[i]=a[i]+b[i]; 60 MY\V5Wf&g5bh5=W } C0 \VV5=*:5=W bh5=A H0 RPY=a5=A bh5Af&g i0 \VV5& K585bh5& D4FCB=-67C7-BH4=7C:B5@P 67C7-BH4=7C:B5@-6B54-:5-H7=7>>4> void v_add (float *c, float *a, M33>? float *b) M33>? {80 MY\V,H5*f&?&KCg5bh5=,* 80 MY\V5*f&g5bh5=* for ((intint i=0;i=0; ii<=<= MAX; ii++)++) 60 MY\V,H5Wf&?&KCg5c[c[i]=a[i]+b[i];i]=a[i]+b[i];bh5=,W 60 MY\V5Wf&g5bh5=W }C0 \VV,H5=,*:5=,W bh5=,A C0 \VV5=*:5=W bh5=A H0 RPY=a,H5=,A bh5Af&?&KCg H0 RPY=a5=A bh5Af&g i0 \VV5& K5H5bh5& i0 \VV5& K585bh5& D4FCB=-67C7-BH4=7C:B5@P 67C7-BH4=7C:B5@-6B54-:5-H7=7>>4>Q)5A*$$5%'&<5j,)A%3(&F*%&31k void v_add (float *c, float *a, M33>? float *b) M33>? {80 MY\V,H5*f&?&KCg5bh5=,* 80 MY\V5*f&g5bh5=* for ((intint i=0;i=0; ii<=<= MAX; ii++)++) 60 MY\V,H5Wf&?&KCg5c[c[i]=a[i]+b[i];i]=a[i]+b[i];bh5=,W 60 MY\V5Wf&g5bh5=W }C0 \VV,H5=,*:5=,W bh5=,A C0 \VV5=*:5=W bh5=A H0 RPY=a,H5=,A bh5Af&?&KCg H0 RPY=a5=A bh5Af&g i0 \VV5& K5H5bh5& i0 \VV5& K585bh5& vector data operations: data operations done in parallel void v_add (float *c, float *a, float *b) { for (int i=0; i<= MAX; i++) c[i]=a[i]+b[i]; }

PROBLEM: This LOOP is NOT LEGAL to (automatically) VECTORIZE in C / C++ (without more information). • Arrays not really in the language • Pointers are, evil pointers! vectorization solutions

1. auto-vectorization (let the compiler do it when legal) – sequential languages and practices gets in the way 2. give the compiler hints – C99 “” keyword (implied in FORTRAN since 1956) – Traditional pragmas like “#pragma IVDEP” – Cilk Plus #pragma simd – OpenMP 4.0 #pragma omp simd 3. code differently – simd instruction intrinsics – Cilk Plus array notations – Cilk Plus __declspec (vector) – OpenMP 4.0 #pragma omp declare simd – OpenCL / CUDA kernel functions D4FCB=:`7C:B5-@B>GC:B5@

A^ NBIR>TMSIRLJ[NIJRP&6CMI&IKM&SR=3(%

Structure of arrays (SoA) can be easily aligned to cache boundaries and is vectorizable.

Array of structures (AoS) tends to cause cache alignment problems, and is hard to vectorize. Data Layout: Alignment

Array of Structures (AoS), padding at end.

Array of Structures (AoS), padding after each structure.

Structure of Arrays (SoA), padding at end.

Structure of Arrays (SoA), padding after each component.

51 Think Parallel – Parallelism does NOT work well if “added” at the last minute in a program

– think about every thing “how to do in parallel cleanly” '%%>?@@///0"#$%&%'()*+&12*1+,-.03(2@ 7 +08,$0*+2.&,4$#+).9*%)(&*$./(09.%0+*: 678H?5D#$%&%'()*+&125-3(5m&<#*$5a--)A%< a.>*1<&3153-5"*%)(&*$5A3,)()+5&15%'&<5%#%3(&*$5>$#<5<3")5).%(*<0 3,-$&2' '%%>?@@%'()*+&12W#&$+&12W$3A]<0A3" 6779?5L1%)$5P'()*+&125X#&$+&125X$3A]< =)"*&1<5*15).A)$$)1%5&1%(3+#A%&315%35 PXXo51)/)(5-)*%#()<5*()513%5A3,)()+ 3,-$&2';.<'&,)2);.=*>*,)2);.?0()*,

'%%>?@@>*(*$$)$W33]0A3" 8)@2&%).'*2./()).%)*4'&,-.9*%)(&*$2 '%%>?@@$3%<3-A3()<0A3" 6786?5R%(#A%#()+5N*(*$$)$5N(32(*""&12 E&2'5N)(-3("*1A)5N*(*$$)$&<" a.A)$$)1%5%).%5-3(5)<<)1%&*$<53-5>*(*$$)$5>(32(*""&125 678i5JN)*($<5m3$#")5P/3 -3(5O@OKK 678H5JN)*($<5m3$#")5Y1) 3,-$&2';.=*>*,)2) 3,-$&2'

678p?5L1%)$q5r)315N'&s5N(3A)<<3( E&2' N)(-3("*1A)5N(32(*""&12 t1&2'%5M*1+&125a+&%&31555555555553,-$&2' $975;-JBGd

!"#$%&%'()*+,-.

///0"#$%&%'()*+&12*1+,-.03(2

!"#$%&%'()*+&,-./0(.1&2#*$.3//)4%25 Agenda 2:00pm - 2:40pm: Multithreading Introduction and Overview James Reinders 2:40pm - 3:25pm: Parallel Processing and Multi-Process Evaluation in a Host Application Andy Lin and Joe Longson 3:25pm: Break 3:35pm - 4:05pm: Threading SOPs in Houdini Jeff Lait 4:05pm - 4:35pm: Building a Scalable Evaluation Engine for Presto Florian Zitzelsberger 4:35pm - 5:15pm: Parallel Evaluation of Animated Maya Characters Martin De Lasa