Session: Threading for Performance with Intel® Threading Building Blocks
INTEL CONFIDENTIAL Objjec tives
At theee en dod of thi ssessos session ,you, you will b eae abl e t o:
• List the different components of the Intel Threading Building Blocks library
• Code parallel applications using TBB
SOFTWARE AND SERVICES 2 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AdAggenda
Intel®®adgudgobagoud Threading Building Blocks background Generic Parallel Algorithms pp_arallel_for parallel_reduce paralle l_sort example Task Scheduler GiGeneric HihlHighly Concurren t CtiContainers Scalable Memory Allocation Low-level Synchronization Primitives
SOFTWARE AND SERVICES 3 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Intel®®gg() Threading Building Blocks (TBB) -Key Features
You sp ecify tasks ((aauouwhat can run concurrently) y)ado instead of thread s • Library maps your logical tasks onto physical threads, efficiently using cache and balancing load • Full support for nested paralle lism Targets threading for scalable performance • Portable across Linux*, Mac OS*, Windows*, and Solaris* Emphasizes scalable data parallel programming • Loop parallelism tasks are more scalable than a fixed number of separate tasks Compatible with other thre ading pack age s • Can be used in concert with native threads and OpenMP
Open source and lilidcensed versions availilblable
SOFTWARE AND SERVICES 4 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Limit ati ons
Intel® TBB is not intended for • I/O bound processing • Real-time processing
SOFTWARE AND SERVICES 5 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtComponents off TBB (i2)(version 2.1)
Parallel algorithms paralle l_ for (improve d) parallel_reduce (improved) Concurrent containers para lle l_do ((enew) concurrent_hash _map pipeline (improved) concurrent_queue parallel_ sort concurrent_vector paralle l_scan (all improved) Taskhdlk scheduler With new functionality Synchronization primitives Utilities atomic operations tick_count various flavors of mutexes (improved) tbb_thread (new) Memory alllltocators tbb_allocator (()new),,_g_, cache_aligned_allocator, scalable_ allocator
SOFTWARE AND SERVICES 6 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TkTask-base d Progggrammi ng with In te l® ® TBB
Taaagsks are light-weiggauht entities at user-level • Intel® TBB parallel algorithms map tasks onto threads automatically • Tas k sched ul er manages the thread pool – Scheduler is unfair to favor tasks that have been most recent in the cache • Oversubscription and undersubscription of core resources is prevented by task-stealing technique of TBB scheduler
SOFTWARE AND SERVICES 7 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGeneric ParallelPlll AlgorithmsAlith
INTEL CONFIDENTIAL GiGeneric Progggrammi ng
Best knooapwn example is C++ STL Enables distribution of broadly-useful high-quality algorithms and data stttructures Write best possible algorithm with fewest constraints • Do not force particular data structure on user • Class ic examp le: STL st d::sort Instantiate algorithm to specific situation • C++ template instantiation, partial specialization, and inlining make resulting code efficient
Standard Template Library, overall, is not thread-safe
SOFTWARE AND SERVICES 9 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGeneric Progggrammi ng - ElExamplp e
The coopmpiler crea tes the need dded versio ns
T must define a copy constructor and a destructor
temppyplate
SOFTWARE AND SERVICES 10 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Int el® ®gg Threa ding Bu ilding Bloc ks Pa tterns
Task scheduler powers high level parallel patterns that are pre- packaged, tested, and tuned for scalability • parallel_for: load-balanced parallel execution of loop iterations where iterations are independent • parallel_reduce : load-balanced parallel execution of independent loop iterations that perform reduction (e.g . summation of array elements) • parallel_ while: load-balanced parallel execution of independent loop itera tions w ith un known or dynami call y ch angi ng bound s (e.g. appl yi ng functi on to the el ement of link ed list) • pp_arallel_scan: temppppplate function that computes parallel prefix • pipeline: data-flow pipeline pattern • parallel_sort : parallel sort
SOFTWARE AND SERVICES 11 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGrain Size
OpenMP haaapaas similar parameter Part of parallel_for , not underlying task scheduler • Grain size exists to amortize overhead, not balance load • Units of granularity are loop iterations Typically only need to get it right within an order of magnitude
SOFTWARE AND SERVICES 12 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TiTuning GGirain Size
too fine too coarse schdliheduling overhead ddiomina tes lose poten tia l para lle lism
• Tune byygg examining single-ppprocessor performance • When in doubt, err on the side of making it a little too large, so that performance is not hurt when only one core is available.
SOFTWARE AND SERVICES 13 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. The paralle llf_ for TlTemplp a te
template
Requires definition of: • A range type to iterat e over – Must define a copy constructor and a destructor – Defines is_empty () – Defines is_divisible () – Defines a splitting constructor, R(R &r, split) •A body type t hat ope rates o n t he ra nge (o r a subr ange ) – Must define a copy constructor and a destructor – DfiDefines operat()tor()
SOFTWARE AND SERVICES 14 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. BdBodyy is Generi c
Requirements fo r para llel_ foro Body
Body::Body(const Body&) Copy constructor Bodyyy()::~Body() Destructor
void Body::operator() (Range& subrange) const Apply the body to subrange. paralle l_ for partitions ori g ina l range in to su branges, an d dea ls ou t subranggyes to worker threads in a way that: • Balances load • Uses cache efficiently •Scales
SOFTWARE AND SERVICES 15 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Range is GGieneric
Requirements fo r para llel_ foro Raagnge
R::R (const R&) Copy constructor R::~R() Destructor bool R::is_empty() const True if range is empty bool R::is_divisible() const True if range can be partitioned R::R (R& r, split) Splitting constructor; splits r into two subranges
Library provides predefined ranges • blocked_range and blocked_range2d You can define your own ranges
SOFTWARE AND SERVICES 16 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. How splittipglitting work s on block ed _grange2d
Split range...
.. recursively...
tasks available to be ...until grainsize. sche du le d to o ther threa ds (()thieves)
SOFTWARE AND SERVICES 17 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. An Examplpgp le us ing para lllfllelf_ or
Inddpepen dent iter aoations a adnd fixed/ d/oknown bou boudnds
const int N = 100000;
void change_array(float change array(float array, int M) { for(inti=0;i int main (){ float A[N]; initialize_array(A); changg_e arra y(A, N); return 0; } SOFTWARE AND SERVICES 18 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. An Examplpgp le us ing para lllfllelf_ or Include a nd initia lize the lib ra ry Include Library Headers int#include main “tbb/task(){ _scheduler_init.h” #includefloat “tbb/blockedA[N]; _range.h” #includeinitialize “tbb/_y()pp_ arraarallely(A); for.h” changg_e arra y(,y(A, N ); usinreturnggp names 0;p ace tbb; } int main (){ Use namespace tasktask_scheduler_init scheduler init init; float A[N]; initializeinitialize_array(A); array(A); parallelparallel_change_array(A change array(A, N); ret0turn 0; } blue = original code Initialize scheduler green = provided by TBB red = boilerplate for library SOFTWARE AND SERVICES 19 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. An Examplpgp le us ing para lllfllelf_ or blue = or ig ina l cod e Use the pp_arallel for pattern green = provid e d by TBB red = boiler plate for libr ary class ChangeArrayBody { void changg_earra y(float *arra y, int M) { float *array; for ((;;){int i = 0; i < M; i++){ public: Define Task arr ay[i ] *= 2 ; Change Array BdBody (flt*)(float *a): array (){}(a) {} } void operator()( const blocked_range void parallel_change_array(float parallel change array(float *array, int M) { parallelparallel_for for (blockedblocked_range range SOFTWARE AND SERVICES 20 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Ac tivity ityp 1: Mat ri x Multipli lti licati on Convert serial matrix multiplication application into parallel application using parallel_ for • Triple-nestedld loops SOFTWARE AND SERVICES 21 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. The paralle l_re duce TlTemplp a te template Requirements for parallel_reduce Body BdBody::B Bd(ody( const BBd&ody&, split ) Splitti ng construc tor BdBody::~B Bd()ody() DtDestruct or void Body::operator() (Range& subrange) const Accumulate results from subrange void Body::join( Body& rhs ); Merge result of rhs into the result of this. Reuses Range concept from parallel_ for SOFTWARE AND SERVICES 22 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. SilSerial Examplp le // Find index of smallest element in a[0... n-1] long SerialMinIndex ( const float a[], sizsize_tetn){ n ) { float valuevalue_of_min of min = FLTFLT_MAX; MAX; long indexindex_of_min of min = -1; for( sizesize_t t i=0; i SOFTWARE AND SERVICES 23 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Parall el Versi on ()(1 o f 3) class MinIndexBody { const float *const mymy_a; a; public: float value_of_min; On next slides: long index_of_min; ... operator() MidinIndexBo d(dy ( const fl[])float a[] ) : Sp lit construct or mymy_a(a) a(a), valuevalue_of_min(FLT_MAX) of min(FLT MAX), join indexindex_of_min( of min(-1) {} }; // Find index of smallest element in a[0...n-1] long ParallelMinIndex ( const float a[], size_t n ) { MiIdMinIndex BdBody m ib()ib(a); parallelparallel_reduce(blocked_range reduce(blocked range SOFTWARE AND SERVICES 24 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Parall el Versi on ()(2 o f 3) class MinIndexBody { const float *const my_a; my a; public: accumulate result ...... void operator ()( const blocked_g range blue = original code green = provided by TBB red = boilerplate for library SOFTWARE AND SERVICES 25 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Parall el Versi on ()(3 o f 3) class MinIndexBody { const float *const my_a; my a; public: . . . MiIdMinIndex Bd(MiIdBd&Body( MinIndexBody& x, split ) : split my_a()(x.my_a), valfi()lue_of_min(FLT_MAX), index_of_min(-1) {} void join( const MinIndexBody& y ) { if( yy__.value of min < value __ of min ) { join value__y__ of min = y.value of min; index__y__; of min = y.index of min; } } . . . blue = original code green = provided by TBB red = boilerplate for library SOFTWARE AND SERVICES 26 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AtiitActivityyp 2: parall lllel_red uce LLbab Nuuagaoodoopumerical Integration code to compute Pi SOFTWARE AND SERVICES 27 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. PlllParallel SSort EExample l (ih(with workk stealing)li) INTEL CONFIDENTIAL QikQuicksort – Step 1 tbb::parallel_sort (color, color+64); THREAD 1 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 1 starts with the initial data SOFTWARE AND SERVICES 29 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 2 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 Thread 1 partitions/ split s its dat a 37 SOFTWARE AND SERVICES 30 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 2 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 Thread 2 gets work by stealing from Thread 1 37 SOFTWARE AND SERVICES 31 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 3 THREAD 1 THREAD 2 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 Thread 1 Thread 2 par ti ti on s/spli ts i ts data pp/partitions/splits its data 7 37 49 SOFTWARE AND SERVICES 32 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 3 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 Thread 3 gets work by Thread 4 gets work by stealing from Thread 1 stealing from Thread 2 7 37 49 SOFTWARE AND SERVICES 33 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 4 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35 Thread 1 sorts the Thread 3 partitions/splits Thread 2 sorts the Thread 4 sorts the rest of its data its data rest its data rest of its data 0 1 2 3 4 5 6 7 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 SOFTWARE AND SERVICES 34 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 5 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35 Thread 3 sorts the Thread 1 gets more rest of its data work by stealing from Thread 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 SOFTWARE AND SERVICES 35 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 6 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35 Thread 1 19 25 26 30 29 33 22 24 21 27 36 32 28 partitions/ split s its dat a 20 23 31 34 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 SOFTWARE AND SERVICES 36 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 6 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35 19 25 26 30 29 33 22 24 21 27 36 32 28 20 23 31 34 35 Thread 1 sorts the Thread 2 gets more work rest of its ddtata by stealing from Thread 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 SOFTWARE AND SERVICES 37 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 7 THREAD 1 THREAD 3 THREAD 2 THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63 12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63 11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35 19 25 26 30 29 33 22 24 21 27 36 32 28 20 23 31 34 35 Thread 2 sorts the rest of its data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 DONE SOFTWARE AND SERVICES 38 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TkTask SchedulerShdl INTEL CONFIDENTIAL TkBdATask Based Apppproach Intel® TBB ppypprovides C++ constructs that allow you to express parallel solutions in terms of task objects • Task scheduler manages thread pool • Task scheduler avoids common performance problems of programming with threads Problem Intel® TBB Approach ObitiOversubscription One sch ed ul er threa d per hard ware thread Fair scheduling Non-preemptive unfair scheduling High overhead Programmer specifies tasks, not threads Load imbalance Work-stealing balances load SOFTWARE AND SERVICES 40 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Examplp le: NiNaive Fibonacc i CClltialculation Recuuorsion typ ypayudoauaboaically used to calculate Fibonacci nu ubmber Widely used as toy benchmark • Easy to code • Has unbalanced task graph long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2); } SOFTWARE AND SERVICES 41 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Examplp le: NiNaive Fibonacc i CClltialculation Can envisio n Fibo na cci co mpu ta tio n a s a ta sk g rap h SerialFib(4) SerialFib(3) SerialFib(2) SerialFib(3) SerialFib(2) SilFib(2)SerialFib(2) SilFib(1)SerialFib(1) SilFib(1)SerialFib(1) SilFib(0)SerialFib(0) SerialFib(2) SerialFib(1) SerialFib(1) SerialFib(0) SerialFib(1) SerialFib(0) SilFib(1)SerialFib(1) SilFib(0)SerialFib(0) SOFTWARE AND SERVICES 42 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Fibonacc i - TkTask SpSpgawni ng SSltiolution UUaoadaoaduooagapse TBB tasks to thread creation and execution of task graph Create new root task • Allocate task object • Construct task Spp(awn (execute ),) task, wait for com pletion long ParallelFib((g){ long n ) { long sum; FibTask& a = *new(Task::allocate_ root()) FibTask(n,&sum); Task::spawn____root_and_wait(a); return sum; } SOFTWARE AND SERVICES 43 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Fibonacc i - TkTask SpSpgawni ng SSltiolution Derived from TBB tktask class class FibTask: pu blic task { public: const long n; long* const sum; FibTask( long nThe___, long* execute summethod_ ) : does n(n_), sum(sum_)the computation of a task {} tas k* execu t(){te() { //O// Overr ides v itirtua lfl func tion tktask::execu te if( n SOFTWARE AND SERVICES 44 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. FthFurther OtiitiOptipymizations Enabl blded byb SShdlcheduler Recyyacle tasks • Avoid overhead of allocating/freeing Task • AidAvoid copying dtdata and rerunning construct ors/d /dtestruct ors Coouaopagntinuation passing • Instead of blocking, parent specifies another Task that will continue its work when children are done. • Furth er reduces stack space and enabl es bypassi ng sched ul er Bypassing scheduler • Task can return pointer to next Task to execute – For example, parent returns pointer to its llfeft child – See include/tbb/pp_arallel_for.h for example • Saves push/pop on deque (and locking/unlocking it) SOFTWARE AND SERVICES 45 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Activityyg #3: Task Scheduler Interface for Recursive Algorithms Develop code to launch Intel TBB tasks to build and traverse a binary tree. SOFTWARE AND SERVICES 46 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GGeneric i HighlyHi hl ConcurrentC t CtiContainers INTEL CONFIDENTIAL CtConcurrent CCtiontainers TBB Library provides highly concurrent containers • STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt contai ner • Standard practice is to wrap a lock around STL containers – Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations • Worse single-thread performance, but better scalability. • Can be used with the library, OpenMP, or native threads. SOFTWARE AND SERVICES 48 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concurrency-FiFrien dldlyy Int erf aces Sooaayoouyme STL interfaces are inherently not concurrency-frienddyly For example, suppose two thhdreads each execute: extern std::queue q; if((q!q.em py()){pty()) { At this instant, another thread item=q.front(); might pop last element. q.pop(); } Solution: concurrent_queue has pop_if _present SOFTWARE AND SERVICES 49 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent Queue CCtiontainer coouncurrent_queu e SOFTWARE AND SERVICES 50 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent QpQueue CCtiontainer EElxample #include “tbb/concurrenttbb/concurrent_queue.h queue.h” Simple example to enqueue and print #include int main () { concurrentconcurrent_queue queue for (int i = 0; i < 10; i++) Push items onto queue queue.push(i); whilhile (! queue.empty()) { While more things on queue queue.pop(&j); printf(“ from queue: %d\n”, j); •Pop item off } • Print item return 0; } SOFTWARE AND SERVICES 51 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent VVtector CCtiontainer coouncurrent_ vectoro SOFTWARE AND SERVICES 52 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent VVtector CCtiontainer EElxamplp e void Append( concurrentconcurrent_vector vector Append a string to the array of characters held in concurrent_vector Grow the vect or to accommod dtate new st tiring • grow_ by() returns old size of vector ( first index of new element ) Copy string into vector SOFTWARE AND SERVICES 53 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent HHhash TTblable CCtiontainer coouncurrent_ hash_ map SOFTWARE AND SERVICES 54 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concurrent Hash Table Container Examppyle MyHashCom pare methods User-dddefined meth od haa()sh() taaakes a string gaayadapoa as a key and maps to an integer User-dfiddefined meththdod equal() returns true if two st ri ngs are equal struct Myyp{HashCompare { statstatcse_tas(coststic sizet hash( const strin g& x ){) { sizethsize_t h = 0; for( const char * s = x. cc_str(); str(); *s; s++ ) h = (h*157)^ *s; return h; } static bool equ al( const string& x, const string& y ) { return s t()0trcmp(x, y) == 0; } }; SOFTWARE AND SERVICES 55 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concurrent Hash Table Container Examppyle Key Insert typedef concurrent__p hash map If insert() returns true, new string insertion • Value is key’s place within sequence of strings from getNextString() Otherwise, string has been previously seen SOFTWARE AND SERVICES 56 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concu rre nt Has h Tab l e Conta ine r Exa mp le Key Find myHash table; string s1, s2; int p1, p2; … { myHash::constconst_accessor accessor a;//; // readread_lock lock myHash::constconst_accessor accessor b; if (table.find(a,s1)s1) && table.find(b,s2)) { // find strings p1 = a->second; p2b2 = b->second; if(1 If find() returns true, key was found within hash table SOFTWARE AND SERVICES 57 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AtiitActivityy 4: Concurrent Cont tiainer LLbab Use a hash table to keep track of the number of string occurrences. SOFTWARE AND SERVICES 58 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. SlblScalable MemoryM AllocationAllti INTEL CONFIDENTIAL SlblScalable Memory AlltAllocators Serial memoryyy allocation can easily become a bottleneck in multithreaded applications • Threads require mutual exclusion into shared heap False sharing - threads accessing the same cache line • Even accessing diidistinct lilocations, cache line can ping-pong Intel® Threading Building Blocks offers two choices for scalable memory allocation • Similar to the STL template class std::allocator • scalbllltlable_allocator – Offers scalability, but not protection from false sharing – Memoryypp is returned to each thread from a separate pool • cachecache_aligned_allocator aligned allocator – Offers both scalability and false sharing protection SOFTWARE AND SERVICES 60 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. MthdMethods for scalbllablea_ lloca tor #include “tbb/scalable_allocator.h” template STL allocator functionality • T* A::allocate( sizesize_type type n, void* hint= 00) ) – Allocate space for n values • void A:: deallocate(T*p( T* p, sizesize_t t n)n ) – Deallocate n values from p • void A::constttruct(T*( T* p, cons tT&l)t T& value ) • void A::destroy( T* p ) SOFTWARE AND SERVICES 61 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. SlblScalable AlltAllocators Examplp le #inc lu de “tbb/sca la ble_a lloca tor. h” Use TBB scalable allocator typedef char _Elem; Elem; for STL basic_string class typedef std::basicstd::basic_string<_Elem string< Elem, stdhtit SOFTWARE AND SERVICES 62 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AtiitActivityyy 5: Memory Allocati on Comp ari son Dooaab scalable memo oyry exercise tha t first u ses “new” a adnd then a sks user to replace with TBB scalable allocator SOFTWARE AND SERVICES 63 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Low-Level Synchronization Primitives INTEL CONFIDENTIAL Int el® ®y TBB: SSync hron iza tion Pri miti ves Parallel tasks must sometimes touch shared data • When data updates might overlap, use mutual exclusion to avoid race High-level generic abstraction for HW atomic operations • Atomically protect update of single variable Critical regions of code are protected by scoped locks • The range of the lock is determined by its lifetime (scope) • Leaving lock scope calls the destructor, making it exception safe • Mini mi z ing loc k lifeti me avoid s possibl e con ten tion • Several mutex behaviors are available –Sppqin vs. queued –“are we there yet” vs. “wake me when we get there”there – Writer vs. reader/writer (supports multiple readers/single writer) – Scoped wrapper of native mutual exclusion function SOFTWARE AND SERVICES 65 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Atomi c Executi on atomic ‘= x’ and ‘x = ’ read/write value of x xfetchandstorex.fetch_and_store (y) z = x, y = x, return z x.fetch_and_add (y) z = x, x += y, return z x.compare_and_swap (y,p) z = x, if (x==p) x=y; return z atomic SOFTWARE AND SERVICES 66 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. MtMutex Conceptp ts Mutexes are C++ obj ects ba sed o n scop ed lo cking pa ttern Combined with locks, provide mutual exclusion M() Construct unlocked mutex ~M() Destroy unlocked mutex typename MMdlk::scoped_lock CdiCorresponding scopedlkd_lock type M::scoped_lock () Construct lock w/out acquiring a mutex M::scoped_lock (M&) Construct lock and acquire lock on mutex M::~scoped_lock () Release lock if acquired MdlkiM::scoped_lock::acquire (M&) AiAcquire llkock on mut ex M::scoped_lock::release () Release lock SOFTWARE AND SERVICES 67 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. MtMutex Flavors spin_mutex • Non-reentrant, unfair, spins in the user space • VERY FAST in liggyhtly contended situations; use if y ou need to p rotect very few instructions qqgueuing_mutex • Non-reentrant,,,p fair, spins in the user sp ace •Use QQg_ueuing_Mutex when scalabilityyp and fairness are important queuing_rw _mutex • Non-reentrant, fair, spins in the user space spin_ rw_mutex • Non-reentrant, fair, spins in the user space • Use ReaderWriterMutex to allow non-blocking read for multiple threads mtemutex • Wrapper for OS sync: CRITICAL_ SECTION for Wind ows* , pthread _mut ex on Linux * SOFTWARE AND SERVICES 68 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Reader-Writer Lock Example #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ //q* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); readread_shared_data shared data (data); if (!lock.upgrade_to_writer ()) { /* lock was released to upgrade; may be un saf e to access data, r ech eck status bef or e use * / } else { /* lock was not released; no other writer was given access */ } ret0turn 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ } SOFTWARE AND SERVICES 69 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. One ltlast questi on… How do I know how many threads are available? Do not ask! • No t even the sch ed ul er knows how many thread s reall y are avail abl e –There mayyp be other processes running on the machine • Routine may be nested inside other parallel routines Focus on dividi ng your program in to tas ks o f su ffici en t si ze • Task should be biggg enough to amortize scheduler overhead • Choose decompositions with good depth-first cache locality and potential breadth-first parallelism Let the sch ed ul er do the mappi ng SOFTWARE AND SERVICES 70 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. RiReview Intel® Threading Building Blocks is a parallel programming model for C++ applications – Used for computationally intense code – A focus on data parallel programming – Uses generic programming • Int el® Threa ding Bu ilding Bloc ks prov ides – Generic ppgarallel algorithms – Highly concurrent containers – Low-level synchronization primitives – A task scheduler that can be used directly • Know when to select Intel®®gg, Threading Building Blocks, the OpenMP API or Explicit Threading SOFTWARE AND SERVICES 71 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners.