Session: Threading for Performance with Intel® Threading Building Blocks

INTEL CONFIDENTIAL Objjec tives

At theee en dod of thi ssessos session ,you, you will b eae abl e t o:

• List the different components of the Intel Threading Building Blocks library

• Code parallel applications using TBB

SOFTWARE AND SERVICES 2 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AdAggenda

Intel®®adgudgobagoud Threading Building Blocks background Generic Parallel Algorithms pp_arallel_for parallel_reduce paralle l_sort example Task Scheduler GiGeneric HihlHighly Concurren t CtiContainers Scalable Memory Allocation Low-level Synchronization Primitives

SOFTWARE AND SERVICES 3 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Intel®®gg() Threading Building Blocks (TBB) -Key Features

You sp ecify tasks ((aauouwhat can run concurrently) y)ado instead of s • Library maps your logical tasks onto physical threads, efficiently using cache and balancing load • Full support for nested paralle lism Targets threading for scalable performance • Portable across Linux*, Mac OS*, Windows*, and Solaris* Emphasizes scalable data parallel programming • Loop parallelism tasks are more scalable than a fixed number of separate tasks Compatible with other thre ading pack age s • Can be used in concert with native threads and OpenMP

Open source and lilidcensed versions availilblable

SOFTWARE AND SERVICES 4 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Limit ati ons

Intel® TBB is not intended for • I/O bound processing • Real-time processing

SOFTWARE AND SERVICES 5 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtComponents off TBB (i2)(version 2.1)

Parallel algorithms paralle l_ for (improve d) parallel_reduce (improved) Concurrent containers para lle l_do ((enew) concurrent_hash _map pipeline (improved) concurrent_queue parallel_ sort concurrent_vector paralle l_scan (all improved) Taskhdlk scheduler With new functionality Synchronization primitives Utilities atomic operations tick_count various flavors of mutexes (improved) tbb_thread (new) Memory alllltocators tbb_allocator (()new),,_g_, cache_aligned_allocator, scalable_ allocator

SOFTWARE AND SERVICES 6 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TkTask-base d Progggrammi ng with In te l® ® TBB

Taaagsks are light-weiggauht entities at user-level • Intel® TBB parallel algorithms map tasks onto threads automatically • Tas k sched ul er manages the thread pool – Scheduler is unfair to favor tasks that have been most recent in the cache • Oversubscription and undersubscription of core resources is prevented by task-stealing technique of TBB scheduler

SOFTWARE AND SERVICES 7 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGeneric ParallelPlll AlgorithmsAlith

INTEL CONFIDENTIAL GiGeneric Progggrammi ng

Best knooapwn example is C++ STL Enables distribution of broadly-useful high-quality algorithms and data stttructures Write best possible algorithm with fewest constraints • Do not force particular data structure on user • Class ic examp le: STL st d::sort Instantiate algorithm to specific situation • C++ template instantiation, partial specialization, and inlining make resulting code efficient

Standard Template Library, overall, is not thread-safe

SOFTWARE AND SERVICES 9 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGeneric Progggrammi ng - ElExamplp e

The coopmpiler crea tes the need dded versio ns

T must define a copy constructor and a destructor

temppyplate T max ((,y){T x, T y) { if (x < y) return y; return x; } T must define operator< int main() { inti=int i = max(20,5); dblfdouble f = max(2.5 , 5 .2) ; MClMyClass m = max(My Class (“foo ”), My Class (“bar ”)); return 0; }

SOFTWARE AND SERVICES 10 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Int el® ®gg Threa ding Bu ilding Bloc ks Pa tterns

Task scheduler powers high level parallel patterns that are pre- packaged, tested, and tuned for scalability • parallel_for: load-balanced parallel execution of loop iterations where iterations are independent • parallel_reduce : load-balanced parallel execution of independent loop iterations that perform reduction (e.g . summation of array elements) • parallel_ while: load-balanced parallel execution of independent loop itera tions w ith un known or dynami call y ch angi ng bound s (e.g. appl yi ng functi on to the el ement of link ed list) • pp_arallel_scan: temppppplate function that computes parallel prefix • pipeline: data-flow pipeline pattern • parallel_sort : parallel sort

SOFTWARE AND SERVICES 11 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGrain Size

OpenMP haaapaas similar parameter Part of parallel_for , not underlying task scheduler • Grain size exists to amortize overhead, not balance load • Units of granularity are loop iterations Typically only need to get it right within an order of magnitude

SOFTWARE AND SERVICES 12 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TiTuning GGirain Size

too fine  too coarse  schdliheduling overhead ddiomina tes lose poten tia l para lle lism

• Tune byygg examining single-ppprocessor performance • When in doubt, err on the side of making it a little too large, so that performance is not hurt when only one core is available.

SOFTWARE AND SERVICES 13 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. The paralle llf_ for TlTemplp a te

template void parallel_ for(const Range& range, const Body &body);

Requires definition of: • A range type to iterat e over – Must define a copy constructor and a destructor – Defines is_empty () – Defines is_divisible () – Defines a splitting constructor, R(R &r, split) •A body type t hat ope rates o n t he ra nge (o r a subr ange ) – Must define a copy constructor and a destructor – DfiDefines operat()tor()

SOFTWARE AND SERVICES 14 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. BdBodyy is Generi c

Requirements fo r para llel_ foro Body

Body::Body(const Body&) Copy constructor Bodyyy()::~Body() Destructor

void Body::operator() (Range& subrange) const Apply the body to subrange. paralle l_ for partitions ori g ina l range in to su branges, an d dea ls ou t subranggyes to worker threads in a way that: • Balances load • Uses cache efficiently •Scales

SOFTWARE AND SERVICES 15 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Range is GGieneric

Requirements fo r para llel_ foro Raagnge

R::R (const R&) Copy constructor R::~R() Destructor bool R::is_empty() const True if range is empty bool R::is_divisible() const True if range can be partitioned R::R (R& r, split) Splitting constructor; splits r into two subranges

Library provides predefined ranges • blocked_range and blocked_range2d You can define your own ranges

SOFTWARE AND SERVICES 16 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. How splittipglitting work s on block ed _grange2d

Split range...

.. recursively...

tasks available to be ...until  grainsize. sche du le d to o ther threa ds (()thieves)

SOFTWARE AND SERVICES 17 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. An Examplpgp le us ing para lllfllelf_ or

Inddpepen dent iter aoations a adnd fixed/ d/oknown bou boudnds

const int N = 100000;

void change_array(float change array(float array, int M) { for(inti=0;i

int main (){ float A[N]; initialize_array(A); changg_e arra y(A, N); return 0; }

SOFTWARE AND SERVICES 18 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. An Examplpgp le us ing para lllfllelf_ or

Include a nd initia lize the lib ra ry

Include Library Headers int#include main “tbb/task(){ _scheduler_init.h” #includefloat “tbb/blockedA[N]; _range.h” #includeinitialize “tbb/_y()pp_ arraarallely(A); for.h” changg_e arra y(,y(A, N ); usinreturnggp names 0;p ace tbb; } int main (){ Use namespace tasktask_scheduler_init scheduler init init; float A[N]; initializeinitialize_array(A); array(A); parallelparallel_change_array(A change array(A, N); ret0turn 0; } blue = original code Initialize scheduler green = provided by TBB red = boilerplate for library

SOFTWARE AND SERVICES 19 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. An Examplpgp le us ing para lllfllelf_ or

blue = or ig ina l cod e Use the pp_arallel for pattern green = provid e d by TBB red = boiler plate for libr ary

class ChangeArrayBody { void changg_earra y(float *arra y, int M) { float *array; for ((;;){int i = 0; i < M; i++){ public: Define Task arr ay[i ] *= 2 ; Change Array BdBody (flt*)(float *a): array (){}(a) {} } void operator()( const blocked_range & r ) const{ } for (int i = r.begin(); i != r.end(); i++ ){ array[i] *= 2; } } Use Pattern }; EtblihEstablish grain size

void parallel_change_array(float parallel change array(float *array, int M) { parallelparallel_for for (blockedblocked_range range (0, M, IdealGrainSize), ChangeArrayBody(array)); }

SOFTWARE AND SERVICES 20 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Ac tivity ityp 1: Mat ri x Multipli lti licati on

Convert serial matrix multiplication application into parallel application using parallel_ for • Triple-nestedld loops

SOFTWARE AND SERVICES 21 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. The paralle l_re duce TlTemplp a te

template void parallel_reduce (const Range& range, Body &body);

Requirements for parallel_reduce Body

BdBody::B Bd(ody( const BBd&ody&, split ) Splitti ng construc tor

BdBody::~B Bd()ody() DtDestruct or void Body::operator() (Range& subrange) const Accumulate results from subrange

void Body::join( Body& rhs ); Merge result of rhs into the result of this.

Reuses Range concept from parallel_ for

SOFTWARE AND SERVICES 22 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. SilSerial Examplp le

// Find index of smallest element in a[0... n-1] long SerialMinIndex ( const float a[], sizsize_tetn){ n ) { float valuevalue_of_min of min = FLTFLT_MAX; MAX; long indexindex_of_min of min = -1; for( sizesize_t t i=0; i

SOFTWARE AND SERVICES 23 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Parall el Versi on ()(1 o f 3)

class MinIndexBody { const float *const mymy_a; a; public: float value_of_min; On next slides: long index_of_min; ... operator() MidinIndexBo d(dy ( const fl[])float a[] ) : Sp lit construct or mymy_a(a) a(a), valuevalue_of_min(FLT_MAX) of min(FLT MAX), join indexindex_of_min( of min(-1) {} };

// Find index of smallest element in a[0...n-1] long ParallelMinIndex ( const float a[], size_t n ) { MiIdMinIndex BdBody m ib()ib(a); parallelparallel_reduce(blocked_range reduce(blocked range(0 t>(0,n ,GrainSize) , mib ); return mib .index_of_min; index of min; } blue = original code green = provided by TBB red = boilerplate for library

SOFTWARE AND SERVICES 24 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Parall el Versi on ()(2 o f 3)

class MinIndexBody { const float *const my_a; my a; public: accumulate result ...... void operator ()( const blocked_g range& r ) { const float* a = myy_a; for((_ size t i = r.begin(); i != r.end; ++i )){ { float value = a[i]; if( value

blue = original code green = provided by TBB red = boilerplate for library

SOFTWARE AND SERVICES 25 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Parall el Versi on ()(3 o f 3)

class MinIndexBody { const float *const my_a; my a; public: . . . MiIdMinIndex Bd(MiIdBd&Body( MinIndexBody& x, split ) : split my_a()(x.my_a), valfi()lue_of_min(FLT_MAX), index_of_min(-1) {} void join( const MinIndexBody& y ) { if( yy__.value of min < value __ of min ) { join value__y__ of min = y.value of min; index__y__; of min = y.index of min; } } . . . blue = original code green = provided by TBB red = boilerplate for library

SOFTWARE AND SERVICES 26 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AtiitActivityyp 2: parall lllel_red uce LLbab

Nuuagaoodoopumerical Integration code to compute Pi

SOFTWARE AND SERVICES 27 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. PlllParallel SSort EExample l (ih(with workk stealing)li)

INTEL CONFIDENTIAL QikQuicksort – Step 1

tbb::parallel_sort (color, color+64);

THREAD 1

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 starts with the initial data

SOFTWARE AND SERVICES 29 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 2

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

Thread 1 partitions/ split s its dat a

37

SOFTWARE AND SERVICES 30 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 2

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

Thread 2 gets work by stealing from Thread 1

37

SOFTWARE AND SERVICES 31 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 3

THREAD 1 THREAD 2

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

Thread 1 Thread 2 par ti ti on s/spli ts i ts data pp/partitions/splits its data

7 37 49

SOFTWARE AND SERVICES 32 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 3

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

Thread 3 gets work by Thread 4 gets work by stealing from Thread 1 stealing from Thread 2

7 37 49

SOFTWARE AND SERVICES 33 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 4

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35

Thread 1 sorts the Thread 3 partitions/splits Thread 2 sorts the Thread 4 sorts the rest of its data its data rest its data rest of its data

0 1 2 3 4 5 6 7 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

SOFTWARE AND SERVICES 34 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 5

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35

Thread 3 sorts the Thread 1 gets more rest of its data work by stealing from Thread 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

SOFTWARE AND SERVICES 35 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 6

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35

Thread 1 19 25 26 30 29 33 22 24 21 27 36 32 28 partitions/ split s its dat a 20 23 31 34 35

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

SOFTWARE AND SERVICES 36 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 6

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35

19 25 26 30 29 33 22 24 21 27 36 32 28 20 23 31 34 35 Thread 1 sorts the Thread 2 gets more work rest of its ddtata by stealing from Thread 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

SOFTWARE AND SERVICES 37 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. QikQuicksort – Step 7

THREAD 1 THREAD 3 THREAD 2 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 52 47 41 43 53 60 61 38 56 48 59 54 50 37 15176181610223131482436322822343515 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 49 51 45 62 39 42 40 58 55 57 44 46 63

12 29 27 19 20 30 33 31 25 21 11 15 45 47 41 43 50 52 51 54 62 1 0 2 6 7 172618161092313148243617 26 18 16 10 9 23 13 14 8 24 36 46 44 40 38 49 59 56 61 58 55 4 5 3 32 28 22 34 35 42 48 39 57 60 53 63

11 8 14 13 21 25 26 31 33 30 9 10 16 12 18 20 23 19 27 29 24 17 15 36 32 28 22 34 35

19 25 26 30 29 33 22 24 21 27 36 32 28 20 23 31 34 35 Thread 2 sorts the rest of its data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

DONE SOFTWARE AND SERVICES 38 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TkTask SchedulerShdl

INTEL CONFIDENTIAL TkBdATask Based Apppproach

Intel® TBB ppypprovides C++ constructs that allow you to express parallel solutions in terms of task objects • Task scheduler manages thread pool • Task scheduler avoids common performance problems of programming with threads

Problem Intel® TBB Approach

ObitiOversubscription One sch ed ul er threa d per hard ware thread

Fair scheduling Non-preemptive unfair scheduling

High overhead Programmer specifies tasks, not threads

Load imbalance Work-stealing balances load

SOFTWARE AND SERVICES 40 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Examplp le: NiNaive Fibonacc i CClltialculation

Recuuorsion typ ypayudoauaboaically used to calculate Fibonacci nu ubmber Widely used as toy benchmark • Easy to code • Has unbalanced task graph

long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2); }

SOFTWARE AND SERVICES 41 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Examplp le: NiNaive Fibonacc i CClltialculation

Can envisio n Fibo na cci co mpu ta tio n a s a ta sk g rap h

SerialFib(4)

SerialFib(3) SerialFib(2)

SerialFib(3) SerialFib(2)

SilFib(2)SerialFib(2) SilFib(1)SerialFib(1) SilFib(1)SerialFib(1) SilFib(0)SerialFib(0)

SerialFib(2) SerialFib(1) SerialFib(1) SerialFib(0)

SerialFib(1) SerialFib(0)

SilFib(1)SerialFib(1) SilFib(0)SerialFib(0)

SOFTWARE AND SERVICES 42 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Fibonacc i - TkTask SpSpgawni ng SSltiolution

UUaoadaoaduooagapse TBB tasks to thread creation and execution of task graph

Create new root task • Allocate task object • Construct task Spp(awn (execute ),) task, wait for com pletion

long ParallelFib((g){ long n ) { long sum; FibTask& a = *new(Task::allocate_ root()) FibTask(n,&sum); Task::spawn____root_and_wait(a); return sum; }

SOFTWARE AND SERVICES 43 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Fibonacc i - TkTask SpSpgawni ng SSltiolution

Derived from TBB tktask class class FibTask: pu blic task { public: const long n; long* const sum; FibTask( long nThe___, long* execute summethod_ ) : does n(n_), sum(sum_)the computation of a task {} tas k* execu t(){te() { //O// Overr ides v itirtua lfl func tion tktask::execu te if( n

SOFTWARE AND SERVICES 44 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. FthFurther OtiitiOptipymizations Enabl blded byb SShdlcheduler

Recyyacle tasks • Avoid overhead of allocating/freeing Task • AidAvoid copying dtdata and rerunning construct ors/d /dtestruct ors Coouaopagntinuation passing • Instead of blocking, parent specifies another Task that will continue its work when children are done. • Furth er reduces stack space and enabl es bypassi ng sched ul er Bypassing scheduler • Task can return pointer to next Task to execute – For example, parent returns pointer to its llfeft child – See include/tbb/pp_arallel_for.h for example • Saves push/pop on deque (and locking/unlocking it)

SOFTWARE AND SERVICES 45 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Activityyg #3: Task Scheduler Interface for Recursive Algorithms

Develop code to launch Intel TBB tasks to build and traverse a binary tree.

SOFTWARE AND SERVICES 46 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GGeneric i HighlyHi hl ConcurrentC t CtiContainers

INTEL CONFIDENTIAL CtConcurrent CCtiontainers

TBB Library provides highly concurrent containers • STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt contai ner • Standard practice is to wrap a around STL containers – Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations • Worse single-thread performance, but better scalability. • Can be used with the library, OpenMP, or native threads.

SOFTWARE AND SERVICES 48 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concurrency-FiFrien dldlyy Int erf aces

Sooaayoouyme STL interfaces are inherently not concurrency-frienddyly

For example, suppose two thhdreads each execute:

extern std::queue q;

if((q!q.em py()){pty()) { At this instant, another thread item=q.front(); might pop last element. q.pop(); }

Solution: concurrent_queue has pop_if _present

SOFTWARE AND SERVICES 49 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent Queue CCtiontainer

coouncurrent_queu e • Preserves local FIFO order – If thread push es and anoth er thread pops two val ues, they come out in the same or der that they went in •Method pp(ush(const T& )pppylaces copy of item on back of q ueue • Two kinds of pops – Blocking – pop(T&) –non-blocking –pop_if_present(T&) • Method size() returns signed integer – If size() returns –n, it means n pops await corresponding pushes • MthdMethod empt()ty() returns si()ize() == 0 – Difference between ppppushes and pops – May return true if queue is empty, but there are pending pop()

SOFTWARE AND SERVICES 50 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent QpQueue CCtiontainer EElxample

#include “tbb/concurrenttbb/concurrent_queue.h queue.h” Simple example to enqueue and print #include integers using namespace tbb;

int main () { concurrentconcurrent_queue queue queue; Constructor for queue int j;

for (int i = 0; i < 10; i++) Push items onto queue queue.push(i);

whilhile (! queue.empty()) { While more things on queue queue.pop(&j); printf(“ from queue: %d\n”, j); •Pop item off } • Print item return 0; }

SOFTWARE AND SERVICES 51 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent VVtector CCtiontainer

coouncurrent_ vectoro • Dynamically growable array of T – MthdMethod grow_by(sitize_type dlt)delta) appends dltdelta eltlements to end of vector –Method gg___row_to_at_least(size__yptypen) adds elements until vector has at least n elements – Method size() returns the number of elements in the vector – MthdMethod empt()ty() returns si()ize() == 0 • Never moves elements until cleared – Can concurrently access and grow – Method clear() is not thread-safe with respect to access/resizing

SOFTWARE AND SERVICES 52 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent VVtector CCtiontainer EElxamplp e

void Append( concurrentconcurrent_vector vector& V, const char* string) { sizesize_type type n = strlen(string)+1; memcpy( &V[V.growgrow_by(n) by(n)], string, n+1 ); }

Append a string to the array of characters held in concurrent_vector Grow the vect or to accommod dtate new st tiring • grow_ by() returns old size of vector ( first index of new element ) Copy string into vector

SOFTWARE AND SERVICES 53 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtConcurrent HHhash TTblable CCtiontainer

coouncurrent_ hash_ map • Maps Key to element of type T • You dfidefine cl ass HhCHashCompare with two meth od s –hash() mappys Key to hashcode of type size_t – equal() returns true if two Keys are equal • Enables concurrent find(), insert(), and erase() operations – find() and insert() set “smart pointer” that acts as lock on item – accessor grants read-write access – const_ accessor grants read-only access – loc k release d when smart poi nt er is dest royed

SOFTWARE AND SERVICES 54 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concurrent Container Examppyle MyHashCom pare methods

User-dddefined meth od haa()sh() taaakes a string gaayadapoa as a key and maps to an integer User-dfiddefined meththdod equal() returns true if two st ri ngs are equal

struct Myyp{HashCompare { statstatcse_tas(coststic sizet hash( const strin g& x ){) { sizethsize_t h = 0; for( const char * s = x. cc_str(); str(); *s; s++ ) h = (h*157)^ *s; return h; } static bool equ al( const string& x, const string& y ) { return s t()0trcmp(x, y) == 0; } };

SOFTWARE AND SERVICES 55 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concurrent Hash Table Container Examppyle Key Insert

typedef concurrent__p hash map myHash; myHash table; string newstring; int place = 0; … while (getNextString(&newString)) { myHash::accessor a; if (tbltable.itinsert( a, newStr ing )) // new s tr ing inser te d a->second = ++p lace; }

If insert() returns true, new string insertion • Value is key’s place within sequence of strings from getNextString() Otherwise, string has been previously seen

SOFTWARE AND SERVICES 56 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Concu rre nt Has h Tab l e Conta ine r Exa mp le Key Find

myHash table; string s1, s2; int p1, p2; … { myHash::constconst_accessor accessor a;//; // readread_lock lock myHash::constconst_accessor accessor b; if (table.find(a,s1)s1) && table.find(b,s2)) { // find strings p1 = a->second; p2b2 = b->second; if(1

If find() returns true, key was found within hash table

SOFTWARE AND SERVICES 57 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AtiitActivityy 4: Concurrent Cont tiainer LLbab

Use a hash table to keep track of the number of string occurrences.

SOFTWARE AND SERVICES 58 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. SlblScalable MemoryM AllocationAllti

INTEL CONFIDENTIAL SlblScalable Memory AlltAllocators

Serial memoryyy allocation can easily become a bottleneck in multithreaded applications • Threads require mutual exclusion into shared heap False sharing - threads accessing the same cache line • Even accessing diidistinct lilocations, cache line can ping-pong

Intel® Threading Building Blocks offers two choices for scalable memory allocation • Similar to the STL template class std::allocator • scalbllltlable_allocator – Offers scalability, but not protection from false sharing – Memoryypp is returned to each thread from a separate pool • cachecache_aligned_allocator aligned allocator – Offers both scalability and false sharing protection

SOFTWARE AND SERVICES 60 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. MthdMethods for scalbllablea_ lloca tor

#include “tbb/scalable_allocator.h” template class scalablescalable_allocator allocator; Scalable versions of malloc, free, realloc, calloc • void *scalablescalable_malloc malloc( sizesize_t t size ); • void scalable_ free( void *ptr ); • void *scalable_realloc( void *ptr, size_t size ); • void *scalable_calloc( size_t nobj, size_t size );

STL allocator functionality • T* A::allocate( sizesize_type type n, void* hint= 00) ) – Allocate space for n values • void A:: deallocate(T*p( T* p, sizesize_t t n)n ) – Deallocate n values from p • void A::constttruct(T*( T* p, cons tT&l)t T& value ) • void A::destroy( T* p )

SOFTWARE AND SERVICES 61 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. SlblScalable AlltAllocators Examplp le

#inc lu de “tbb/sca la ble_a lloca tor. h” Use TBB scalable allocator typedef char _Elem; Elem; for STL basic_string class typedef std::basicstd::basic_string<_Elem string< Elem, stdhtittd::char_traits<_Elem>, tbb::scalable_allocator<_Elem>> MyString; . . . { Use TBB scalable allocator . . . to allocate 24 integers int *p; MyString str1 = "qwertyuiopasdfghjkl" ; MyString str2 = "asdfghjklasdfghjkl"; p = tbb::scalbllltlable_allocator().alllloca te(24); . . . }

SOFTWARE AND SERVICES 62 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AtiitActivityyy 5: Memory Allocati on Comp ari son

Dooaab scalable memo oyry exercise tha t first u ses “new” a adnd then a sks user to replace with TBB scalable allocator

SOFTWARE AND SERVICES 63 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Low-Level Synchronization Primitives

INTEL CONFIDENTIAL Int el® ®y TBB: SSync hron iza tion Pri miti ves

Parallel tasks must sometimes touch shared data • When data updates might overlap, use mutual exclusion to avoid race

High-level generic abstraction for HW atomic operations • Atomically protect update of single variable Critical regions of code are protected by scoped locks • The range of the lock is determined by its lifetime (scope) • Leaving lock scope calls the destructor, making it exception safe • Mini mi z ing loc k lifeti me avoid s possibl e con ten tion • Several mutex behaviors are available –Sppqin vs. queued –“are we there yet” vs. “wake me when we get there”there – Writer vs. reader/writer (supports multiple readers/single writer) – Scoped wrapper of native mutual exclusion function

SOFTWARE AND SERVICES 65 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Atomi c Executi on

atomic • T should be integral type or pointer type • Full type-safe support for 8 , 16, 32, and 64-bit integers Operations

‘= x’ and ‘x = ’ read/write value of x xfetchandstorex.fetch_and_store (y) z = x, y = x, return z x.fetch_and_add (y) z = x, x += y, return z x.compare_and_swap (y,p) z = x, if (x==p) x=y; return z

atomic i; ...... int z = i.fetchfetch_and_add and add(2);

SOFTWARE AND SERVICES 66 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. MtMutex Conceptp ts

Mutexes are C++ obj ects ba sed o n scop ed lo cking pa ttern Combined with locks, provide mutual exclusion

M() Construct unlocked mutex

~M() Destroy unlocked mutex

typename MMdlk::scoped_lock CdiCorresponding scopedlkd_lock type

M::scoped_lock () Construct lock w/out acquiring a mutex

M::scoped_lock (M&) Construct lock and acquire lock on mutex

M::~scoped_lock () Release lock if acquired

MdlkiM::scoped_lock::acquire (M&) AiAcquire llkock on mut ex

M::scoped_lock::release () Release lock

SOFTWARE AND SERVICES 67 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. MtMutex Flavors

spin_mutex • Non-reentrant, unfair, spins in the user space • VERY FAST in liggyhtly contended situations; use if y ou need to p rotect very few instructions qqgueuing_mutex • Non-reentrant,,,p fair, spins in the user sp ace •Use QQg_ueuing_Mutex when scalabilityyp and fairness are important queuing_rw _mutex • Non-reentrant, fair, spins in the user space spin_ rw_mutex • Non-reentrant, fair, spins in the user space • Use ReaderWriterMutex to allow non-blocking read for multiple threads mtemutex • Wrapper for OS sync: CRITICAL_ SECTION for Wind ows* , pthread _mut ex on Linux *

SOFTWARE AND SERVICES 68 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Reader-Writer Lock Example

#include “tbb/spin_rw_mutex.h” using namespace tbb;

spin_rw_mutex MyMutex;

int foo (){ //q* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false);

readread_shared_data shared data (data);

if (!lock.upgrade_to_writer ()) { /* lock was released to upgrade; may be un saf e to access data, r ech eck status bef or e use * / } else { /* lock was not released; no other writer was given access */ }

ret0turn 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ }

SOFTWARE AND SERVICES 69 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. One ltlast questi on…

How do I know how many threads are available?

Do not ask! • No t even the sch ed ul er knows how many thread s reall y are avail abl e –There mayyp be other processes running on the machine • Routine may be nested inside other parallel routines Focus on dividi ng your program in to tas ks o f su ffici en t si ze • Task should be biggg enough to amortize scheduler overhead • Choose decompositions with good depth-first cache locality and potential breadth-first parallelism Let the sch ed ul er do the mappi ng

SOFTWARE AND SERVICES 70 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. RiReview

Intel® Threading Building Blocks is a parallel programming model for C++ applications – Used for computationally intense code – A focus on data parallel programming – Uses generic programming • Int el® Threa ding Bu ilding Bloc ks prov ides – Generic ppgarallel algorithms – Highly concurrent containers – Low-level synchronization primitives – A task scheduler that can be used directly • Know when to select Intel®®gg, Threading Building Blocks, the OpenMP API or Explicit Threading

SOFTWARE AND SERVICES 71 Copyright © 2014, Intel Corporation. All rights reserved. Int el and the Int el logo are trad emark s or regi st ered trad emark s of Int el Corporati on or its sub sidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners.