Best Practice Guide Intel Xeon Phi V2.0

Best Practice Guide Intel Xeon Phi v2.0 Emanouil Atanassov, IICT-BAS, Bulgaria Michaela Barth, KTH, Sweden Mikko Byckling, CSC, Finland Vali Codreanu, SURFsara, Netherlands Nevena Ilieva, NCSA, Bulgaria Tomas Karasek, IT4Innovations, Czech Republic Jorge Rodriguez, BSC, Spain Sami Saarinen, CSC, Finland Ole Widar Saastad, University of Oslo, Norway Michael Schliephake, KTH, Sweden Martin Stachon, IT4Innovations, Czech Republic Janko Strassburg, BSC, Spain Volker Weinberg (Editor), LRZ, Germany 31-01-2017 1 Best Practice Guide Intel Xeon Phi v2.0 Table of Contents 1. Introduction .............................................................................................................................. 5 2. Intel MIC architecture & system overview ..................................................................................... 6 2.1. The Intel MIC architecture ............................................................................................... 6 2.1.1. Intel Xeon Phi coprocessor architecture overview ....................................................... 6 2.1.2. The cache hierarchy .............................................................................................. 8 2.2. Network configuration & system access .............................................................................. 9 3. Programming Models ............................................................................................................... 15 4. Native compilation ................................................................................................................... 16 5. Offloading .............................................................................................................................. 17 5.1. Introduction .................................................................................................................. 17 5.2. Simple C example ......................................................................................................... 17 5.3. Simple Fortran-90 example ............................................................................................. 18 5.3.1. User function offload. .......................................................................................... 18 5.3.2. Offloading regions in program .............................................................................. 19 5.3.3. Offloading functions or subroutines ........................................................................ 19 5.3.4. Load distribution and balancing, hosts cpus, co-processors (single and multiple) .............. 20 5.4. Obtaining informations about the offloading ....................................................................... 22 5.5. Syntax of pragmas ........................................................................................................ 23 5.6. Recommendations ......................................................................................................... 24 5.6.1. Explicit worksharing ........................................................................................... 24 5.6.2. Persistent data on the coprocessor .......................................................................... 24 5.6.3. Optimizing offloaded code ................................................................................... 25 5.7. Intel Cilk Plus parallel extensions .................................................................................... 25 6. OpenMP, hybrid and OpenMP 4.x Offloading .............................................................................. 26 6.1. OpenMP ...................................................................................................................... 26 6.1.1. OpenMP basics .................................................................................................. 26 6.1.2. Threading and affinity ......................................................................................... 27 6.1.3. Loop scheduling ................................................................................................. 28 6.1.4. Scalability improvement ....................................................................................... 28 6.2. Hybrid OpenMP/MPI ..................................................................................................... 28 6.2.1. Programming models ........................................................................................... 28 6.2.2. Threading of the MPI ranks .................................................................................. 28 6.3. OpenMP 4.x Offloading ................................................................................................. 29 6.3.1. Execution Model ................................................................................................ 29 6.3.2. Overview of the most important device constructs .................................................... 30 6.3.3. The target construct ............................................................................................ 31 6.3.4. The teams construct ............................................................................................ 31 6.3.5. The distribute construct ....................................................................................... 32 6.3.6. Composite constructs and shortcuts in OpenMP 4.5 ................................................... 32 6.3.7. Examples .......................................................................................................... 33 6.3.8. Runtime routines and environment variables ............................................................ 34 6.3.9. Best Practices .................................................................................................... 34 7. MPI ...................................................................................................................................... 34 7.1. Setting up the MPI environment ...................................................................................... 34 7.2. MPI programming models .............................................................................................. 35 7.2.1. Coprocessor-only model ....................................................................................... 35 7.2.2. Symmetric model ................................................................................................ 36 7.2.3. Host-only model ................................................................................................. 36 7.3. Simplifying launching of MPI jobs ................................................................................... 36 8. Intel MKL (Math Kernel Library) ............................................................................................... 37 8.1. Introduction .................................................................................................................. 37 8.2. MKL usage modes ........................................................................................................ 37 8.2.1. Automatic Offload (AO) ...................................................................................... 38 8.2.2. Compiler Assisted Offload (CAO) ......................................................................... 38 8.2.3. Native Execution ................................................................................................ 39 2 Best Practice Guide Intel Xeon Phi v2.0 8.3. Example code ............................................................................................................... 40 8.4. Intel Math Kernel Library Link Line Advisor ..................................................................... 40 9. TBB: Intel Threading Building Blocks ........................................................................................ 40 9.1. Advantages of TBB ....................................................................................................... 40 9.2. Using TBB natively ....................................................................................................... 41 9.3. Offloading TBB ............................................................................................................ 43 9.4. Examples ..................................................................................................................... 43 9.4.1. Exposing parallelism to the processor ..................................................................... 43 9.4.2. Vectorization and Cache-Locality .......................................................................... 47 9.4.3. Work-stealing versus Work-sharing ........................................................................ 47 10. IPP: The Intel Integrated Performance Primitives ......................................................................... 47 10.1. Overview of IPP ......................................................................................................... 47 10.2. Using IPP .................................................................................................................. 48 10.2.1. Getting Started ................................................................................................. 48 10.2.2. Linking of Applications ..................................................................................... 49 10.3. Multithreading ............................................................................................................ 49 11. Further programming models ..................................................................................................

Best Practice Guide Intel Xeon Phi V2.0

Microbenchmarks in Big Data

Overview of the SPEC Benchmarks

Introduction to Intel Xeon Phi Programming Models

Hypervisors Vs. Lightweight Virtualization: a Performance Comparison

Power Measurement Tutorial for the Green500 List

Openimageio 1.7 Programmer Documentation (In Progress)

Accelerators for HP Proliant Servers Enable Scalable and Efficient High-Performance Computing

Quick-Reference Guide to Optimization with Intel® Compilers

Broadwell Skylake Next Gen* NEW Intel NEW Intel NEW Intel Microarchitecture Microarchitecture Microarchitecture

Energy Efficient Spin-Locking in Multi-Core Machines

On the Performance of MPI-Openmp on a 12 Nodes Multi-Core Cluster

Openimageio Release 2.4.0