Threading for Performance with Intel® Threading Building Blocks

Session: Threading for Performance with Intel® Threading Building Blocks INTEL CONFIDENTIAL Objjec tives At theee en dod of thi ssessos session ,you, you will b eae abl e t o: • List the different components of the Intel Threading Building Blocks library • Code parallel applications using TBB SOFTWARE AND SERVICES 2 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AdAggenda Intel®®adgudgobagoud Threading Building Blocks background Generic Parallel Algorithms pp_arallel_for parallel_reduce paralle l_sort example Task Scheduler GiGeneric HihlHighly Concurren t CtiContainers Scalable Memory Allocation Low-level Synchronization Primitives SOFTWARE AND SERVICES 3 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Intel®®gg() Threading Building Blocks (TBB) -Key Features You sp ecify tasks ((aauouwhat can run concurrently) y)ado instead of thread s • Library maps your logical tasks onto physical threads, efficiently using cache and balancing load • Full support for nested paralle lism Targets threading for scalable performance • Portable across Linux*, Mac OS*, Windows*, and Solaris* Emphasizes scalable data parallel programming • Loop parallelism tasks are more scalable than a fixed number of separate tasks Compatible with other thre ading pack age s • Can be used in concert with native threads and OpenMP Open source and lilidcensed versions availilblable SOFTWARE AND SERVICES 4 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Limit ati ons Intel® TBB is not intended for • I/O bound processing • Real-time processing SOFTWARE AND SERVICES 5 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. CtComponents off TBB (i2)(version 2.1) Parallel algorithms paralle l_ for (improve d) parallel_reduce (improved) Concurrent containers para lle l_do ((enew) concurrent_hash _map pipeline (improved) concurrent_queue parallel_ sort concurrent_vector paralle l_scan (all improved) Taskhdlk scheduler With new functionality Synchronization primitives Utilities atomic operations tick_count various flavors of mutexes (improved) tbb_ thread (new) Memory alllltocators tbb_allocator (()new),,_g_, cache_aligned_allocator, scalable_ allocator SOFTWARE AND SERVICES 6 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TkTask-base d Progggrammi ng with In te l® ® TBB Taaagsks are light-weiggauht entities at user-level • Intel® TBB parallel algorithms map tasks onto threads automatically • Tas k sched ul er manages the thread pool – Scheduler is unfair to favor tasks that have been most recent in the cache • Oversubscription and undersubscription of core resources is prevented by task-stealing technique of TBB scheduler SOFTWARE AND SERVICES 7 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGeneric ParallelPlll AlgorithmsAlith INTEL CONFIDENTIAL GiGeneric Progggrammi ng Best knooapwn example is C++ STL Enables distribution of broadly-useful high-quality algorithms and data stttructures Write best possible algorithm with fewest constraints • Do not force particular data structure on user • Class ic examp le: STL st d::sort Instantiate algorithm to specific situation • C++ template instantiation, partial specialization, and inlining make resulting code efficient Standard Template Library, overall, is not thread-safe SOFTWARE AND SERVICES 9 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGeneric Progggrammi ng - ElExamplp e The coopmpiler crea tes the need dded versio ns T must define a copy constructor and a destructor temppyplate <typename T> T max ((,y){T x, T y) { if (x < y) return y; return x; } T must define operator< int main() { inti=int i = max(20,5); dblfdouble f = max(2.5 , 5 .2) ; MClMyClass m = max(My Class (“foo ”), My Class (“bar ”)); return 0; } SOFTWARE AND SERVICES 10 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. In tel® ®gg Threa ding Bu ilding Bloc ks Pa tterns Task scheduler powers high level parallel patterns that are pre- packaged, tested, and tuned for scalability • parallel_for: load-balanced parallel execution of loop iterations where iterations are independent • parallel_reduce : load-balanced parallel execution of independent loop iterations that perform reduction (e.g . summation of array elements) • parallel_ while: load-balanced parallel execution of independent loop itera tions with unknown or dynami call y changi ng bound s (e.g. apply ing functi on to the element of link ed list) • pp_arallel_scan: temppppplate function that computes parallel prefix • pipeline: data-flow pipeline pattern • parallel_sort : parallel sort SOFTWARE AND SERVICES 11 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. GiGrain Size OpenMP haaapaas similar parameter Part of parallel_ for, not underlying task scheduler • Grain size exists to amortize overhead, not balance load • Units of granularity are loop iterations Typically only need to get it right within an order of magnitude SOFTWARE AND SERVICES 12 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. TiTuning GGirain Size too fine too coarse schdliheduling overhea d didomina tes lose poten tia l paralle lism • Tune byygg examining single-ppprocessor performance • When in doubt, err on the side of making it a little too large, so that performance is not hurt when only one core is available. SOFTWARE AND SERVICES 13 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. The paralle llf_ for TlTemplp a te template <typename Range, typename Body> void parallel_ for(const Range& range, const Body &body); Requires definition of: • A range type to iterat e over – Must define a copy constructor and a destructor – Defines is_empty () – Defines is_divisible () – Defines a splitting constructor, R(R &r, split) •A body type t hat ope rates o n t he ra nge (o r a subra nge ) – Must define a copy constructor and a destructor – DfiDefines operat()tor() SOFTWARE AND SERVICES 14 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. BdBodyy is Generi c Requirements fo r para llel_ foro Body Body::Body(const Body&) Copy constructor Bodyyy()::~Body() Destructor void Body::operator() (Range& subrange) const Apply the body to subrange. paralle l_ for partitions orig ina l range in to subranges, and dea ls out subranggyes to worker threads in a way that: • Balances load • Uses cache efficiently •Scales SOFTWARE AND SERVICES 15 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Range is GiGeneric Requirements fo r para llel_ foro Raagnge R::R (const R&) Copy constructor R::~R() Destructor bool R::is_empty() const True if range is empty bool R::is_divisible() const True if range can be partitioned R::R (R& r, split) Splitting constructor; splits r into two subranges Library provides predefined ranges • blocked_range and blocked_range2d You can define your own ranges SOFTWARE AND SERVICES 16 Copyright © 2014, Intel Corporation.

Threading for Performance with Intel® Threading Building Blocks

A Fast Concurrent and Resizable Robin Hood Hash Table Endrias

Intel Threading Building Blocks

16 Concurrent Hash Tables: Fast and General(?)!

Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

Read-Copy-Update for Helenos (Slides)

Concurrent Data Structures

29. Lock-Based Concurrent Data Structures Operating System: Three Easy Pieces

Blocking and Non-Blocking Concurrent Hash Tables in Multi-Core Systems

Phase-Concurrent Hash Tables for Determinism

Fast, Dynamically-Sized Concurrent Hash Table⋆

Concurrent Hash Tables: Fast and General?(!)

Scalable Concurrent Hash Tables Via Relativistic Programming