Intel Xeon-Phi Coprocessor : Prog. Env
Total Page:16
File Type:pdf, Size:1020Kb
C-DAC Four Days Technology Workshop ON Hybrid Computing – Coprocessors/Accelerators Power-Aware Computing – Performance of Applications Kernels hyPACK-2013 Mode 3 : Intel Xeon Phi Coprocessors Lecture Topic : Intel Xeon-Phi Coprocessor : Prog. Env Venue : CMSD, UoHYD ; Date : October 15-18, 2013 C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 1 An Overview of Prog. Env on Intel Xeon-Phi Lecture Outline Following topics will be discussed Understanding of Intel Xeon-Phi Coprocessor Architecture Programming on Intel Xeon-Phi Coprocessor Performance Issues on Intel Xeon-Phi Coprocessor C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 2 Intel Xeon Phi - Coprocessor : Multi-to-May-Core Processors : Background C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 3 Programming paradigms-Challenges Large scale data Computing – Current trends HowTrends to run Computing Programs - Observationsfaster ? You require Super Computer Era of Single - Multi-to-Many Core - Heterogeneous Computing Sequential Computing How to run Programs faster ? Fetch/Store Fast Access of data Compute Fast Processor More Memory to Manage data C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 4 Dual Core Processor Conceptual diagram of a dual-core CPU, with CPU-local Level 1 caches, and Shared, on-chip Level 2 caches C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 5 Relationship among Processors, Processes, &Threads Processors Processors µPi Processes . Map to MMU OP1 OP2 OPn Map to Processors . Threads T1 T2 Tm µPi : Processor OP1 : Process T1 : Thread MMU : Main Memory Unit Source : Reference [4],[6], [7] C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 6 Multi-threaded Processing using Hyper-Threading Technology Time taken to process n threads on a single processor is significantly more than a single processor system with HT technology enabled. Multithreading Hyper-threading Technology App 0 App 1 App 2 App 0 App 1 App 2 T0 T1 T2 T3 T4 T5 T0 T1 T2 T3 T4 T5 Thread Pool Thread Pool CPU CPU CPU T0 T1 T2 T3 T4 T5 LP0 T0 T2 T4 LP1 Time T1 T3 T5 CPU 2 Threads per Processor Time Source : http://www.intel.com ; Reference : [6], [29], [31] P/C : Microprocessor and cache; SM : Shared memory C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 7 System View of Threads Threads Above the Operating System Understand the problems - Face using the threads – Runtime Environment Defining and Operating Executing Preparing Threads Threads Threads Performed by Performed by Performed by Programming OS using Processors Environment Processes and Compiler Showing return trip to represent that after execution operations get pass to user space Flow of Threads in an Execution Environment Source : Reference [4],[6], [7] C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 8 Architecture-Algorithm Co-Design 30 New instructions Cache improvements 25 HW thread scheduling 20 Baseline 15 10 5 0 Speedup vs. Best serial code vs. Best Speedup 1 2 4 8 16 32 64 Number of cores Source : http://www.intel.com C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 9 GPU Computing : Think in Parallel Performance = parallel hardware + scalable parallel program GPU Computing drives new applications Reducing “Time to Discovery” Application 100 x Speedup changes science & research methods CPU GPU New applications drive the future of GPUs Drives new GPU capabilities Drives hunger for more performance Source : NVIDIA, AMD,References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 10 Programming paradigms-Challenges Large scale data Computing – Current trends Accelerators – GPU : Think in Parallel Speedups of 8 x to 30x are quite common for certain class of applications Best results when you “Think Data Parallel” Application Design your algorithm for data- parallelism CPU GPU Understand parallel algorithmic complexity and efficiency Use data-parallel algorithmic primitives as building blocks Source : NVIDIA, AMD, References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 11 Accelerator Computing : Think in Parallel Accelerators –GPU : Think in Parallel Speedups of 8 x to 30x are quite common for certain class of applications The GPU is a data-parallel processor Thousands of parallel threads Thousands of data elements to process Application All data processed by the same program SPMD computation model CPU GPU Contrast with task parallelism and ILP Source : NVIDIA, AMD, References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 12 Systems with Accelerators A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. CPU ACC Single thread perf. throughput CPU ACC. Physical integration CPU : Control Path & Data Path ALU : Arithmetic and logic units CPU & ACC ACC: Accumulator PC : Program counter Architectural integration Micro-Instructions (Register transfer) Source : NVIDIA, AMD, SGI, Intel, IBM Alter, Xilinux References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 13 Systems with Accelerators A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. Source : NVIDIA, AMD, SGI, Intel, IBM Alter, Xilinux References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 14 Multi-Core Systems with Accelerator Types (partial set) FPGA Xilinx, Alter GPU Nvidia (Kepler), AMD Trinity APU MIC (Intel Xeon-Phi) Intel Xeon-Phi (MIC) Source : NVIDIA, AMD, SGI, Intel, IBM Alter, Xilinux References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 15 Programming paradigms-Challenges Large scale data Computing – Current trends Computation Data Information Applications al Science Informatics Technology Very-high Efficient, Deterministic, Declarative, Restrictive level Expressiveness based language UMA & NUMA Systems Systems NUMA & UMA Parallel Prog. Languages (OpenMP, PGAS, – High level Intel TBB, Cilk Plus, OpenACC, CUDA) Low-level Low-level APIs (MPI, Pthreads, OpenCL, Verilog) SoftwareThreading Hybrid Computing Hybrid Core Systems Systems Core - Very Machine code, Assembly low-level May - to - Heterogeneous Hardware Multi Coprocessors GPUs FPGAs Source : NVIDIA, AMD, SGI, Intel, IBM Alter, Xilinux & References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 16 Programming paradigms-Challenges Large scale data Computing – Current trends Typical UMA /NUMA Computing Systems Shared System updates & Performance Memory Cores Improvements Quantify impacts prior to implementation Processes Is the machine working ? Performance be expected Threads Small prototype available Memory What will be the performance of system ? Affinity System available for measurement What will be performance for App Data Parallel Resources Availability Application Scaling – Resources Avilable Compilers …………..………… Performance …………..………… Programming API - Multi-Core Systems with Accelerator Types (partial set) DO TRANSFER from Host to Device Use as Linux OS DO Parallel Perform C omp. On Device DO TRANSFER from Host to Device DO Synchronize Get Maximum of all values Give to Compiler Parallel Prog. Languages (OpenMP, PGAS, Intel TBB, Cilk Plus, OpenACC, CUDA) Programming with High Level APIs Host : Multi-Core Systems OR ARM Multi-core Systems CPU- Multi-Core Sys With GPU - Nvidia (Kepler), MIC (Intel Xeon-Phi) FPGA (Xilinx, Alter) AMD Trinity APU Source : NVIDIA, AMD, SGI, Intel, IBM Alter, Xilinux References C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 18 Prog. On Hybrid Computing Platforms Unimform Memory /Non-Uiniform Memory Cluster : MPI OpenMP 4.0 (Coprocessors / Acclerators CUDA OpenMP Intel TBB Cilk Plus Charm Plus OpenACC General purpose serial and parallel Codes with Highly-parallel computing balanced needs Data Parallel SIMD codes on Acclerators Data Parallel Codes with highly- Task Parallel parallel phases Highly parallel (Data Parallel) codes for Objecti Oriented Source : References & NVIDIA, ntel Xeon-Phi; http://www.intel.com/ C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 19 Intel Xeon Phi - Coprocessor : Architecture Overview C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 20 An Overview of Computing Scenrario Xeon Centric MIC Centric Xeon Scalar Symmetric Parallel MIC Hosted Co-processing Co-processing Hosted General purpose Codes with Highly-parallel serial and parallel balanced needs codes computing Codes with highly- Highly parallel parallel phases codes with scalar phases Source : References & Intel Xeon-Phi; http://www.intel.com/ C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 21 Intel® Xeon Phi™ Architecture Overview Cores: 61 core s, at 1.1 GHz 8 memory controllers in-order, support 4 threads 16 Channel GDDR5 MC 512 bit Vector Processing Unit PCIe GEN2 32 native registers High-speed bi-directional ring interconnect Reliability Features Fully Coherent L2 Cache Parity on L1 Cache, ECC on memory CRC on memory IO, CAP on memory IO Copyright © 2013 Intel Corporation. All rights reserved. C-DAC hyPACK-2013 An Overview of Xeon Phi Coprocessor Prog. Env 22 Intel Xeon Phi Core Architecture Overview 60+ in-order, low power IA cores in a ring Instruction Decode interconnect