Intel® Threading Building Blocks

CERN openlab-Intel MIC / Xeon Phi training Intel® Xeon Phi™ Product Family Intel® Threading Building Blocks Hans Pabst, April 12th 2013 Software and Services Group Intel Corporation Agenda Motivation and Introduction Finding Concurrency Generic Algorithms Task-based Programming Performance Tuning Concurrent Containers Synchronization Primitives Flow Graph 2 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Moore’s Law 90nm 2003 65nm 2005 45nm 2007 32nm 2009 22nm 2011 25 nm 14nm 15nm 2013 10nm Hi-K metal-gate 2015 3-D Tri-gate Shrink 3 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. The Problem Even when we integrate more of the system we still have spare transistors: multicore! Multi-threading: • Reduce or hide latency • Increase throughput Remember Pollack’s rule (Compute ∝ Area) • 4x area gives 2x performance in one core, but 4x performance in 4 cores • In same area multiple wimpy cores provide more compute than one big one How can we program all these cores? 4 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Programmability and Performance “Parallel hardware needs Programmability! parallel programming” Performance GHz Era Multicore Era Manycore Era Time 5 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Didn’t we solve this in the 1990s? Pthreads standard: IEEE 1003.1c-1995 OpenMP standard: 1997 Yes, but… • How to split up work? How to keep caches hot? • How to balance load between threads? • What about nested parallelism (call chain)? Programming with threads is HARD • Atomicity, ordering, and/vs. scalability • Data races, dead locks, etc. 6 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. A Family of Parallel Programming Models Developer Choice Intel® Cilk™ Plus Intel® Threading Domain-Specific Established Research and Building Blocks Libraries Standards Development C/C++ language Widely used C++ Intel® Integrated Message Passing Intel® Concurrent extensions to simplify template library for Performance Interface (MPI) Collections Primitives parallelism parallelism OpenMP* Offload Extensions Intel® Math Kernel Library Intel® Array Building Coarray Fortran Blocks Open sourced Open sourced OpenCL* Intel® SPMD Parallel Also an Intel product Also an Intel product Compiler Choice of high-performance parallel programming models - Libraries for pre-optimized and parallelized functionality - Intel® Cilk™ Plus and Intel® Threading Building Blocks supports composable parallelization of a wide variety of applications. - OpenCL* addresses the needs of customers in specific segments, and provides developers an additional choice to maximize their app performance - MPI supports distributed computation, combines with other models on nodes 7 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE 2013 Phase Productivity Tool Feature Benefit Analyze existing code Easier analysis and performance Advanced Intel® base and find Parallel heuristics, find compute hotspots and opportunities for Design Advisor XE check for parallelization strategies. parallelization. C/C++ and Fortran Advanced Application performance, scalability and Intel® Composer compilers, performance Build & quality for current multicore and future libraries, and parallel Debug XE many-core systems. models Memory & threading error Increases productivity and lowers cost, Advanced Intel® Inspector checking tool for higher by catching memory and threading Verify XE code reliability & quality defects early Removes guesswork, saves time, makes Performance Profiler to Advanced Intel® VTune™ it easier to find performance and optimize performance Tune scalability bottlenecks Combines ease of Amplifier XE and scalability use with deeper insights. 8 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Intel® Threading Building Blocks С++ Library for parallel programming • Takes care of managing multitasking Runtime library • Scalability to available number of threads Cross-platform • Windows, Linux, Mac OS* and others http://threadingbuildingblocks.org/ 9 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. License, and more… GPL with “runtime exception” You can use it commercially without the need to disclose your source code. • Drives C++ compiler features • Regular updates, and maintenance • Community preview features* • Commercially aligned releases • Premier support * You are always invited to provide feedback, to share your issues, and by the way help other this way. 10 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Intel® Threading Building Blocks Concurrent Containers Generic Parallel Algorithms Concurrent access, and a scalable Efficient scalable way to exploit the alternative to containers that are power of multi-core without having externally locked for thread-safety to start from scratch TBB Flow Graph Thread Local Storage Task scheduler Scalable implementation of thread-local data The engine that empowers parallel that supports infinite number of TLS algorithms that employs task- Synchronization Primitives stealing to maximize concurrency User-level and OS wrappers for mutual exclusion, ranging from atomic Miscellaneous Threads operations to several flavors of mutexes Thread-safe timers OS API wrappers and condition variables Memory Allocation Per-thread scalable memory manager and false-sharing free allocators 11 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Agenda Motivation and Introduction Finding Concurrency Generic Algorithms Task-based Programming Performance Tuning Concurrent Containers Synchronization Primitives Flow Graph 12 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Finding Concurrency • Design patterns – An encoded expertise to capture that “quality without a name” that distinguishes truly excellent designs – A small number of patterns can support a wide range of applications • Parallel pattern: commonly occurring combination of task distribution and data access Identify and use parallel patterns! Examples: reduction, or pipeline 13 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Parallel Patterns • Stencil • Superscalar • Reduce sequence • Speculative • Pack & selection Expand • Partition • Map • Nest • Scan and Recurrence • Gather/scatter • Search & Match • Pipeline * http://software.intel.com/sites/products/documentation/hpc/composerxe/en -us/2011Update/tbbxe/Design_Patterns.pdf http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/guyb/papers/Ble90.pdf 14 4/12/2013 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reference Details a pattern language for parallel algorithm design Examples in MPI, OpenMP and Java are given Represents the author's hypothesis for how programmers think about parallel programming Patterns for Parallel Programming, Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Addison-Wesley, 2005, ISBN 0321228111 15 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Finding Concurrency From serial code* • Use profiler, such as Intel® VTune™ Amplifier XE • Identify hotspots in an application • Examine the code in hotspots • Determine whether the tasks within the hotspots can be executed independently From a design document • Examine the design components, services, etc. • Find components that contain independent operations * Note, to start from serial code may never lead to the best known parallel algorithm. 16 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Intel® Advisor XE • Tool for what-if analysis – Modeling: use code annotations to introduce parallelism – Evaluation: estimate the effect e.g. the speedup – GUI-driven assistant (5 steps) • Productivity and Safety – Parallel correctness is checked based on a correct program – Non-intrusive API • It’s not auto-parallelization • It’s not modifying the code 17 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Example: Quicksort with Intel Advisor template<typename I> void serial_qsort(I begin, I end) { typedef typename std::iterator_traits<I>::value_type T; if (begin != end) { const I pivot = end - 1; const I middle = std::partition(begin, pivot, std::bind2nd(std::less<T>(), *pivot)); std::swap(*pivot, *middle); ANNOTATE_SITE_BEGIN(Parallel Region); ANNOTATE_TASK_BEGIN(Left Partition); serial_qsort(begin, middle), ANNOTATE_TASK_END(Left Partition); ANNOTATE_TASK_BEGIN(Right Partition); serial_qsort(middle

Intel® Threading Building Blocks

Other Apis What’S Wrong with Openmp?

Beyond BIOS Developing with the Unified Extensible Firmware Interface

Intel Advisor for Dgpu Intel® Advisor Workflows

Getting Started with Oneapi

Michael Steyer Technical Consulting Engineer Intel Architecture, Graphics & Software Analysis Tools

Hands-On Intel® Software Development & Oneapi WORKSHOP

Evaluating Techniques for Parallelization Tuning in MPI, Ompss and MPI/Ompss

Introduction to Intel Performance Tools Part

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Intel® Software Products Highlights and Best Practices

Intel® Offload Advisor

Tips and Tricks: Designing Low Power Native and Webapps