Best Practice Guide Intel Xeon Phi V1.1

Best Practice Guide Intel Xeon Phi v1.1 Michaela Barth, KTH Sweden Mikko Byckling, CSC Finland Nevena Ilieva, NCSA Bulgaria Sami Saarinen, CSC Finland Michael Schliephake, KTH Sweden Volker Weinberg (Editor), LRZ Germany <[email protected]> 14-02-2014 1 Best Practice Guide Intel Xeon Phi v1.1 Table of Contents 1. Introduction .............................................................................................................................. 4 2. Intel MIC architecture & system overview ..................................................................................... 4 2.1. The Intel MIC architecture ............................................................................................... 4 2.1.1. Intel Xeon Phi coprocessor architecture overview ....................................................... 4 2.1.2. The cache hierarchy .............................................................................................. 6 2.2. Network configuration & system access .............................................................................. 7 3. Native compilation ................................................................................................................... 11 4. Intel compiler's offload pragmas ................................................................................................. 12 4.1. Simple example ............................................................................................................ 12 4.2. Obtaining informations about the offloading ....................................................................... 13 4.3. Syntax of pragmas ........................................................................................................ 14 4.4. Recommendations ......................................................................................................... 15 4.4.1. Explicit worksharing ........................................................................................... 15 4.4.2. Persistent data on the coprocessor .......................................................................... 15 4.4.3. Optimizing offloaded code ................................................................................... 16 4.5. Intel Cilk Plus parallel extensions .................................................................................... 17 5. OpenMP and hybrid ................................................................................................................. 17 5.1. OpenMP ...................................................................................................................... 17 5.1.1. OpenMP basics .................................................................................................. 17 5.1.2. Threading and affinity ......................................................................................... 17 5.1.3. Loop scheduling ................................................................................................. 18 5.1.4. Scalability improvement ....................................................................................... 18 5.2. Hybrid OpenMP/MPI ..................................................................................................... 19 5.2.1. Programming models ........................................................................................... 19 5.2.2. Threading of the MPI ranks .................................................................................. 19 6. MPI ...................................................................................................................................... 19 6.1. Setting up the MPI environment ...................................................................................... 19 6.2. MPI programming models .............................................................................................. 20 6.2.1. Coprocessor-only model ....................................................................................... 20 6.2.2. Symmetric model ................................................................................................ 20 6.2.3. Host-only model ................................................................................................. 21 6.3. Simplifying launching of MPI jobs ................................................................................... 21 7. Intel MKL (Math Kernel Library) ............................................................................................... 21 7.1. MKL usage modes ........................................................................................................ 22 7.1.1. Automatic Offload (AO) ...................................................................................... 22 7.1.2. Compiler Assisted Offload (CAO) ......................................................................... 23 7.1.3. Native Execution ................................................................................................ 23 7.2. Example code ............................................................................................................... 24 7.3. Intel Math Kernel Library Link Line Advisor ..................................................................... 24 8. TBB: Intel Threading Building Blocks ........................................................................................ 24 8.1. Advantages of TBB ....................................................................................................... 24 8.2. Using TBB natively ....................................................................................................... 25 8.3. Offloading TBB ............................................................................................................ 27 8.4. Examples ..................................................................................................................... 27 8.4.1. Exposing parallelism to the processor ..................................................................... 27 8.4.2. Vectorization and Cache-Locality .......................................................................... 31 8.4.3. Work-stealing versus Work-sharing ........................................................................ 31 9. IPP: The Intel Integrated Performance Primitives .......................................................................... 31 9.1. Overview of IPP ........................................................................................................... 31 9.2. Using IPP .................................................................................................................... 32 9.2.1. Getting Started ................................................................................................... 32 9.2.2. Linking of Applications ....................................................................................... 33 9.3. Multithreading .............................................................................................................. 33 9.4. Links and References ..................................................................................................... 33 10. Further programming models ................................................................................................... 33 2 Best Practice Guide Intel Xeon Phi v1.1 10.1. OpenCL ..................................................................................................................... 34 10.2. OpenACC .................................................................................................................. 34 11. Debugging ............................................................................................................................ 34 11.1. Native debugging with gdb ........................................................................................... 34 11.2. Remote debugging with gdb .......................................................................................... 34 12. Tuning ................................................................................................................................. 35 12.1. Single core optimization ............................................................................................... 35 12.1.1. Memory alignment ............................................................................................ 35 12.1.2. SIMD optimization ............................................................................................ 36 12.2. OpenMP optimization .................................................................................................. 37 12.2.1. OpenMP thread affinity ...................................................................................... 37 12.2.2. Example: Thread affinity .................................................................................... 38 12.2.3. OpenMP thread placement .................................................................................. 39 12.2.4. Multiple parallel regions and barriers .................................................................... 40 12.2.5. Example: Multiple parallel regions and barriers ...................................................... 40 12.2.6. False sharing .................................................................................................... 42 12.2.7. Example: False sharing ...................................................................................... 42 12.2.8. Memory limitations ..........................................................................................

Best Practice Guide Intel Xeon Phi V1.1

Driving Performance with Intel® Advisor's Flow Graph Analyzer

Intel® Parallel Studio XE 2013 SP1 Product Brief

Intel® Technology Journal | Volume 18, Issue 2, 2014

Intel Compiler

Intel® Oneapi Programming Guide

Использование Intel® Distribution for Python* INTRODUCTION Python Is #1 Programming Language in Hiring Demand Followed by Java and C++

How Intel® Parallel Studio XE 2015 Helps Make Faster Code Faster for HPC

Intel Tuning for JURECA and JUWELS

Разработка Приложений Для Tizen. Tizen SDK

Intel Xeon-Phi an Overview – MKL

Best Practice Guide Intel Xeon Phi V2.0

Intel-Xeon-Phi-Coprocessor-Quick-Start