CS 698L: Intel Threading Building Blocks Swarnendu Biswas

CS 698L: Intel Threading Building Blocks Swarnendu Biswas Semester 2019-2020-I CSE, IIT Kanpur Content influenced by many excellent references, see References slide for acknowledgements. Approaches to Parallelism • New languages • For example, Cilk, X10, Chapel • New concepts, but difficult to get widespread acceptance • Language extensions/pragmas • For example, OpenMP • Easy to extend, but requires special compiler or preprocessor support • Library • For example, C++ STL, Intel TBB, and MPI • Works with existing environments, usually no new compiler is needed CS 698L Swarnendu Biswas What is Intel TBB? • A library to help leverage multicore performance using standard C++ • Does not require programmers to be an expert • Writing a correct and scalable parallel loop is not straightforward • Does not require support for new languages and compilers • Does not directly support vectorization • TBB was first available in 2006 • Current release is 2019 Update 8 • Open source and licensed versions available CS 698L Swarnendu Biswas What is Intel TBB? • TBB works at the abstraction of tasks instead of low-level threads • Specify tasks that can run concurrently instead of threads • Specify work (i.e., tasks), instead of focusing on workers (i.e., threads) • Raw threads are like assembly language of parallel programming • Maps tasks onto physical threads, efficiently using cache and balancing load • Full support for nested parallelism CS 698L Swarnendu Biswas Advantages with Intel TBB • Promotes scalable data-parallel programming • Data parallelism is more scalable than functional parallelism • Functional blocks are usually limited while data parallelism scales with more processors • Not tailored for I/O-bound or real-time processing • Compatible with other threading packages and is portable • Can be used in concert with native threads and OpenMP • Relies on generic programming (e.g., C++ STL) CS 698L Swarnendu Biswas Key Features of Intel TBB Generic Parallel algorithms parallel_for, parallel_for_each, Concurrent containers parallel_reduce, parallel_scan, concurrent_hash_map parallel_do, pipeline, parallel_pipeline, concurrent_unordered_map parallel_sort, parallel_invoke concurrent_queue concurrent_bounded_queue Task scheduler concurrent_vector task_group, structured_task_group, task, task_scheduler_init Synchronization primitives Utilities atomic operations, condition_variable tick_count various flavors of mutexes tbb_thread Memory allocators tbb_allocator, cache_aligned_allocator, scalable_allocator, zero_allocator CS 698L Swarnendu Biswas Task-Based Programming • Challenges with threads: oversubscription or undersubscription, scheduling policy, load imbalance, portability • For example, mapping of logical to physical threads is crucial • Mapping also depends on whether computation waits on external devices • Non-trivial impact of time slicing with context switches, cache cooling effects, and lock preemption • Time slicing allows more logical threads than physical threads CS 698L Swarnendu Biswas Task-Based Programming with Intel TBB • Intel TBB parallel algorithms map tasks onto threads automatically • Task scheduler manages the thread pool • Oversubscription and undersubscription of core resources is prevented by task-stealing technique of TBB scheduler • Tasks are lighter-weight than threads CS 698L Swarnendu Biswas An Example: Hello World #include <iostream> int main() { #include <tbb/tbb.h> task_group tg; using namespace std; tg.run(HelloWorld("1")); using namespace tbb; tg.run(HelloWorld("2")); class HelloWorld { const char* id; tg.wait(); public: return EXIT_SUCCESS; HelloWorld(const char* s) : id(s) {} } void operator()() const { cout << "Hello from task " << id << "\n"; } }; CS 698L Swarnendu Biswas An Example: Hello World #include <iostream> int main() { #include <tbb/tbb.h> task_group tg; using namespace std; tg.run(HelloWorld("1")); using namespace tbb; tg.run(HelloWorld("2")); class HelloWorld { const char* id; tg.wait(); public: return EXIT_SUCCESS; HelloWorld(const char* s) : id(s) {} } void operator()() const { cout << "Hello from task " << id << "\n"; } }; CS 698L Swarnendu Biswas Another Example: Parallel loop #include <chrono> void parallel_incr(float* a) { #include <iostream> tbb::parallel_for(static_cast<size_t>(0), #include <tbb/parallel_for.h> static_cast<size_t>(N), #include <tbb/tbb.h> [&](size_t i) { a[i] += 10; using namespace std; }); using namespace std::chrono; using HRTimer = high_resolution_clock::time_point; } #define N (1 << 26) void seq_incr(float* a) { for (int i = 0; i < N; i++) { a[i] += 10; } } CS 698L Swarnendu Biswas Another Example: Parallel loop int main() { start = high_resolution_clock::now(); float* a = new float[N]; parallel_incr(a); for (int i = 0; i < N; i++) { end = high_resolution_clock::now(); a[i] = static_cast<float>(i); duration = duration_cast<microseconds> (end - start).count(); } cout << "Intel TBB Parallel increment in " << duration << " us\n"; HRTimer start = high_resolution_clock: :now(); return EXIT_SUCCESS; seq_incr(a); } HRTimer end = high_resolution_clock::n ow(); auto duration = duration_cast<microsec onds>(end - start).count(); cout << "Sequential increment in " << duration << " us\n"; CS 698L Swarnendu Biswas Another Example: Parallel loop int main() { start = high_resolution_clock::now(); float* a = new float[N]; parallel_incr(a); for (int i = 0; i < N; i++) { end = high_resolution_clock::now(); a[i] = static_cast<float>(i); duration = duration_cast<microseconds> (end - start).count(); } cout << "Intel TBB Parallel increment in " << duration << " us\n"; HRTimer start = high_resolution_clock: :now(); return EXIT_SUCCESS; seq_incr(a); } HRTimer end = high_resolution_clock::n ow(); auto duration = duration_cast<microsec onds>(end - start).count(); cout << "Sequential increment in " << duration << " us\n"; CS 698L Swarnendu Biswas Initializing the TBB Library #include <tbb/task_scheduler_init.h> • Control when the task scheduler is constructed using namespace tbb; and destroyed. • Specify the number of int main( ) { threads used by the task task_scheduler_init init; scheduler. ... • Specify the stack size for return 0; worker threads } Not required in recent versions, >= TBB 2.2 CS 698L Swarnendu Biswas Thinking Parallel • Decomposition • Decompose the problem into concurrent tasks • Scaling • Identify concurrent tasks to keep processors busy • Threads • Map tasks to threads • Correctness • Ensure correct synchronization to shared resources CS 698L Swarnendu Biswas How to Decompose? Data parallelism Task parallelism CS 698L Swarnendu Biswas How to Decompose? • Distinguishing just between data and task parallelism may not be perfect • Imagine TAs grading questions of varied difficulty • Might need hybrid parallelism or pipelining or work stealing CS 698L Swarnendu Biswas OpenMP vs Intel TBB OpenMP Intel TBB • Language extension consisting of • Library for task-based pragmas, routines, and programming environment variables • Supports C, C++, and Fortran • Supports C++ with generics • User can control scheduling • Automated divide-and-conquer policies approach to scheduling, with work stealing • OpenMP limited to specified • Generic programming is flexible types (for e.g., reduction) with types CS 698L Swarnendu Biswas Generic Parallel Algorithms CS 698L Swarnendu Biswas Generic Programming • Best known example is C++ Standard Template Library (STL) • Enables distribution of useful high-quality algorithms and data structures • Write best possible algorithm with fewest constraints (for e.g., std::sort) • Instantiate algorithm to specific situation • C++ template instantiation, partial specialization, and inlining make resulting code efficient • STL is not generally thread-safe CS 698L Swarnendu Biswas Generic Programming Example • The compiler creates the needed versions T must define a copy constructor and a destructor template <typename T> T max (T x, T y) { if (x < y) return y; return x; } T must define operator < int main() { int i = max(20,5); double f = max(2.5, 5.2); MyClass m = max(MyClass(“foo”), MyClass(“bar”)); return 0; } CS 698L Swarnendu Biswas Intel Threading Building Blocks Patterns • High-level parallel and scalable patterns • parallel_for: load-balanced parallel execution of independent loop iterations • parallel_reduce: load-balanced parallel execution of independent loop iterations that perform reduction • parallel_scan: template function that computes parallel prefix • parallel_while: load-balanced parallel execution of independent loop iterations with unknown or dynamically changing bounds • pipeline: data-flow pipeline pattern • parallel_sort: parallel sort CS 698L Swarnendu Biswas Loop Parallelization • parallel_for and parallel_reduce • Load-balanced, parallel execution of a fixed number of independent loop iterations • parallel_scan • A template function that computes a prefix computation (also known as a scan) in parallel • y[i] = y[i-1] op x[i] CS 698L Swarnendu Biswas TBB parallel_for void SerialApplyFoo(float a[], size_t n) { for (size_t i=0; i<n; ++i) foo(a[i]); } CS 698L Swarnendu Biswas Class Definition for TBB parallel_for #include “tbb/blocked_range.h” class ApplyFoo { float *const m_a; Task public: void operator()(const blocked_range<size_t>& r) const { Body object float *a = m_a; for (size_t i=r.begin(); i!=r.end( ); ++i) foo(a[i]); } ApplyFoo(float a[]) : m_a(a) {} }; CS 698L Swarnendu Biswas TBB parallel_for #include “tbb/parallel_for.h” void ParallelApplyFoo(float a[], size_t n) { parallel_for(blocked_range<size_t>(0,n,grainSize),

CS 698L: Intel Threading Building Blocks Swarnendu Biswas

Heterogeneous Task Scheduling for Accelerated Openmp

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Parallel Programming with Openmp

A Fast Concurrent and Resizable Robin Hood Hash Table Endrias

Intel Threading Building Blocks

Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Lecture 12: Introduction to Openmp (Part 1)

Programming Your GPU with Openmp*

16 Concurrent Hash Tables: Fast and General(?)!