Programming for the Intel Xeon Phi

Programming for the Intel Xeon Phi Michael Florian Hava RISC Software GmbH RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 1 The Road to Xeon Phi RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 2 Intel Pentium (1993 - 1995) . The Pentium was the first superscalar x86 – No out-of-order execution! – Predates all SIMD extensions . 1994: P54C – 75 – 100MHz – Core-design is used in several research projects, including the Xeon Phi architecture! RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 3 Tera-Scale Computing (2006-) . Research project to design TFLOP CPU . 2007: Teraflops Research Chip/Polaris – 80 cores (96-bit VLIW) – 1 TFLOP @ 63W . 2009: Single-chip Cloud Computer – 48 cores (x86) – No cache coherence RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 4 Larrabee (2009) . A fully programmable GPGPU based on x86 – Software renderer for OpenGL, DirectX,… . 32 – 48 cores – 4-way Hyper-Threading – Cache coherence – 512-bit vector registers [LRBni] . Planned product release: 2009-2010 RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 5 Many Integrated Core (2010-) . 2010: Knights Ferry (prototype) – 32 cores @ 1.2 GHz – 4-way Hyper-Threading – 512-bit vector registers [???] – 0.75 TFLOPS @ single precision . 2012: Knights Corner [Xeon Phi] – 57 – 61 cores @ 1.0 – 1.2 GHz – 4-way Hyper-Threading – 512-bit vector registers [KNC] – 1 TFLOP @ double precision RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 6 Xeon Phi . Calculating Peak Xeon Phi FLOPs = – FLOPs = #core × GHz × SIMD vector width × fp-ops – FMA == 2 floating point operations (takes 1 cycle) . SIMD vector width: – Single Precision: 16 elements – Double Precision: 8 elements 57 × 1.1 × 16 × 2 → 2 TFLOPs (single) 57 × 1.1 × 8 × 2 → 1 TFLOPs (double) RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 7 The Future . 2015: Knights Landing – Up to 72 Airmont cores – 4-way Hyper-Threading – 512-bit vector registers [AVX-512] • Support for existing x86 extensions – 3 TFLOPS @ double precision – Both as co-processor and standalone CPU RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 8 Programming for the Xeon Phi RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 9 Supported Technologies Tools Tools MKL MKL Fortran (CAF) Fortran (CAF) TBB TBB CPU native MIC OpenMP Executable PCIe Executable OpenMP C++ C++ Cilk Plus Cilk Plus Offloading OpenCL OpenCL RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 10 Execution Models . Native execution – Copy cross-compiled executable to Phi – Local Linux based OS • Almost completely independent from host system . Offloading – Implicit/Automatic – Explicit/Manual . Message Passing (MPI) – Phi as Cluster or Node RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 11 Offloading . Similar to execution model of GPGPU . Technologies – OpenCL – Intel Offload Extension – OpenMP 4.0 . Xeon Phi reserves one core in offload mode RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 12 Memory Models . Explicit/distinct – Identical to the GPGPU – Memory has to be copied – Limited to primitive types . Implicit/virtual shared – Simulated a SMM – Complex data structures – Only available in C/C++ RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 13 OpenMP 4.0 RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 14 OpenMP 4.0 . Released in July 2013 . Introduces concept several new concepts – Accelerators – SIMD – Thread teams . Partially abandons pure shared memory model – Accelerators contain local memory – Explicit memory & computation offloading RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 15 Tagging for Offload . Types, Functions, Variables, etc. that should be available on the accelerator have to be tagged #pragma omp declare target //contains last error of calculate extern int last_error; struct result_t { float sum, avg, min, max; }; result_t calculate(const float * array, int count); #pragma omp end declare target RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 16 Offloading Computations and Data . Offloading of computation and data is controlled via pragmas float * array = new float[N]; std::generate_n(array, N, rand);//initialize array //create device data context and start computation #pragma omp target map(to: array[0:N], N)\ map(from: result, last_error) result_t result = calculate(array, N); print(result, last_error); delete[] array; RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 17 „Explicit“ Device Data Context Management #pragma omp target data map(to: N) {//data device context exists for this scope float * array = new float[N]; std::generate_n(array, N, rand);//initialize array #pragma omp target update to(array[0:N]) result_t result; #pragma omp target map(from: result) result = calculate(array, N); #pragma omp target update from(last_error) print(result, last_error); delete[] array; } RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 18 Intel OpenMP Extensions (KMP) RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 19 Environment Variables (Xeon Phi) . Xeon Phi environment uses custom prefix – MIC_ENV_PREFIX=##MIC## – Variables with prefix are copied to the Xeon Phi at offload . Samples – ##MIC##_OMP_NUM_THREADS – ##MIC##_KMP_DETERMINISTIC_REDUCTION – ##MIC##_KMP_AFFINITY RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 20 Deterministic OpenMP Reductions . OpenMP does not specify the order in which the partial sums should be combined! – Results are not reproducible! std::vector<float> arr = get_input(); float sum; #pragma omp parallel for reduction(+:sum) for(auto i = 0; i < arr.size(); ++i) sum += arr[i]; . KMP_DETERMINISTIC_REDUCTION=1 – Ensures reproducible results – Slight performance impact RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 21 Thread Affinity . Distribution of OpenMP threads is implementation defined . KMP_AFFINITY allows control of distribution across sockets, CPUs, cores and hyper-threads – Compact: use closest possible cores – Scatter: distribute evenly among all cores – Balanced: distribute evenly among all cores, but keep “close” threads on as close as possible cores • only available for the Xeon Phi • recommended mode RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 22 Thread affinity on Multicore CPU System Core0 Core1 Core2 Core3 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 compact 0 1 2 3 - - - - scatter 0 - 1 - 2 - 3 - RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 23 Thread affinity with Dual-Socket System System Socket0 Socket1 Core0 Core1 Core2 Core3 Core0 Core1 Core2 Core3 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 compact 0 1 2 3 4 5 6 7 - - - - - - - - scatter 0 - 4 - 2 - 6 - 1 - 5 - 3 - 7 - RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 24 Thread affinity on Xeon Phi Phi Core0 Core1 Core2 Core3 Core4 Core5 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 compact 0 1 2 3 4 5 6 7 8 9 A B - - - - - - - - - - - - scatter 0 6 - - 1 7 - - 2 8 - - 3 9 - - 4 A - - 5 B - - balanced 0 1 - - 2 3 - - 4 5 - - 6 7 - - 8 9 - - A B - - RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 25 Castor, 4228m Pollux, 4092m Thank You! zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz www.risc-software.at RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 26 .

Programming for the Intel Xeon Phi

Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase

Exascale Computing Study: Technology Challenges in Achieving Exascale Systems

Research Challenges for On-Chip Interconnection Networks

Unstructured Computations on Emerging Architectures

High-Performance Optimizations on Tiled Many-Core Embedded Systems: a Matrix Multiplication Case Study

High-Performance Optimizations on Tiled Many-Core Embedded Systems: a Matrix Multiplication Case Study

When HPC Meets Big Data in the Cloud

(3-D) Integration Technology

Resilient On-Chip Memory Design in the Nano Era

Research Challenges for On-Chip Interconnection Networks

3D Stacked Memory: Patent Landscape Analysis

Architecture of Large Systems CS-602