HPC System Architecture, Programming Models, Compilers

HPC Programming Models, Compilers, Performance Analysis IBM Systems – Infrastructure Solutions Ludovic Enault, IBM Geoffrey Pascal, IBM 2 Agenda . System architecture trend overview . Programming models & Languages . Compiler . Performance Analysis Tools © 2015 IBM Corporation System architecture trend overview 3 APGAS project review 5IBM2/20/2017 Confidential What’s happening? today Homogeneous Single core Many-core Hybrid Multi-core Heterogeneous GPU/FPGA standard memory model: unique Non-standard memory model: 2 architectures : memory address space separate memory address spaces, Programming Model Complexity . Industry shift to multi-cores/many-cores, accelerators – Intel Xeon+PHI+FPGA, IBM POWER+GPU+FPGA, ARM+GPU+FPGA . Increasing – # Cores – Heterogeneity with Unified Memory – Memory complexity © 2015 IBM Corporation Accelerated Accelerators Kepler Pascal Volta CUDA 5.5 – 7.0 CUDA 8 CUDA 9 Unified Memory Full GPU Paging Cache Coherent Kepler Pascal Volta 1.5TF 16GB @ 1TB/s > 7.0TF 12GB @ 288GB/s 16GB @ 1.2TB/s SXM2 SXM2 PCIe NVLink 1.0 >40+40 GB/s NVLink 2.0 POWER8 On Demand Paging >75+75 GB/s Coherent POWER8+ POWER9 Buffered Memory 2014-2015 2016 2017 © 2015 IBM Corporation 7 Memory Hierarchy and data locality – single node processor core Control units L2 registers cache L3 Main L1 cache • Memory hierarchy cache Memory tries to exploit DDR locality Processor core • CPU: low latency Control units L2 design registers cache L1 cache Accelerator • Data transfers to accelerator are very Control units costly L2 Main Memory registers • Accelerator: high cache GDDR/HBM latency and high L1 cache bandwidth © 2015 IBM Corporation 8 . Parallel Computing: architecture overview © 2015 IBM Corporation 9 Main architecture trends and characteristics . More and more cores (CPU and GPU) per node with Simultaneous Multiple Threading (up to 8 on IBM Power) IBM P8 cores 12 cores Intel Broadwell AMD 4 GHz 8-18 Cores ~3GHz 16-24 core/MCM 128-bit FPU 256-bit FPU 256-bit FPU . Accelerator integration with unified memory hierarchic performance requires data locality . Vector floating point units and SIMD (Single Instruction Multiple Data) operations => Performance requires application vectorization (both operations and data) . Multiple levels of parallelism © 2015 IBM Corporation 10 Parallel Computing: architecture overview P P P P Uniform Memory access (UMA): Each processor/processes has uniform access to memory Memory BUS Shared Memory programming model Memory Cache Coherent Uniform Memory access P P P P Internal P P P P (ccNUMA): buses Time for memory access depends on data Memory BUS Memory BUS location. Local access is faster Shared Memory programming model Memory Memory PCIe Heterogeneous/Hybrid accelerated Accelerator Accelerator processor High performance local High performance local Each accelerator has it’s own local memory memory memory and address space (changing) GPU, FPGA, MIC GPU, FPGA, MIC Hybrid programing model © 2015 IBM Corporation 11 HPC cluster Distributed Memory : Each node has it’s own local memory. Must do message passing to exchange data between nodes (most popular approach is MPI) Cluster Architecture P P P P Internal P P P P P P P P Internal P P P P buses buses Memory BUS Memory BUS Memory BUS Memory BUS Memory Memory Memory Memory PCIe PCIe Accelerator Accelerator Accelerator Accelerator Network © 2015 IBM Corporation Programming Models and Languages Accelerator Address Space CUDA, OpenCL Process/Thread Address Space Accelerator Thread OpenMP, OpenACC .. .. .. .. .. .. .. .. .. Message passing Shared Memory PGAS MPI pThreads, OpenMP, OpenACC,Java UPC, CAF, X10 . Computation is performed in multiple . A datum in one place may reference a datum places. in another place. A place contains data that can be . Data-structures (e.g. arrays) may be operated on remotely. distributed across many places. Data lives in the place it was created, . Places may have different computational for its lifetime. properties 12 © 2015 IBM Corporation Where Does Performance Come From? . Computer Architecture – Instruction issue rate Not anymore distributed and shared . Execution pipelining memory paradigms . Reservation stations . Branch prediction . Cache & memory management Node – Parallelism Socket . Parallelism – number of operations Chip per cycle per processor Core Instruction level parallelism (ILP) Vector processing Thread . Parallelism – number of threads per Register/SIMD core . Parallelism – number of cores per Multiple instruction pipelines processor (SMT) . Parallelism – number of processors per node . Parallelism – number of accelerator per node . Parallelism – number of nodes in a Need to optimize for all levels! system . Device Technology – Memory capacity and access time – Communications bandwidth and latency – Logic switching speed and device density 13 © 2015 IBM Corporation 14 HPC Programming models & languages Multicore, Manycore, Standard ?? C/C++/Fortran OpenMP accelerator, large scale Python, R for Mapreduce/Spark data parallelism task parallelism Shared & Distributed UPC, CAF, ARMCI/Global Arrays, Memory CUDA, OpenCL, OpenACC,CILK, HMPP, StarSc, X10, Chapel, Fortress, Sisal, ... OPenMP, TBB, pTheads, MparReduce… Distributed Memory Many models & libraries Vector Units: SIMD -> MPI Shared Memory Single Memory C++, ADA, perl, Tcl, XML... High level: Fortran, LISP, COBOL, C, ... assembler 1950 1954 1980 1990 1995 2000 2010 2016 … © 2015 IBM Corporation Programming languages & Programming models 15 APGAS project review Different ways to program and Accelerate Applications Applications Libraries Compiler Specific Directives Programming OpenMP/OpnenACC/… Languages Easy to use Easy to use Less portable Most Performance Portable code Optimal performance Is there an existing Can I easily add Is it performance library that can do directives to help critical ? what I want ? the compiler ? © 2015 IBM Corporation 17 Programming languages . 2 main types languages – Compiled: C, C++, Fortran, ADA… . Compilers: GCC, CLANG/LLVM, IBM XL, INTEL, NVIDIA PGI, PathScale, Visual C/C++ – Interpreted: Python, java, R, Ruby, perl,… . Many programming models – Shared memory . Pthreads APIs, OpenMP/OpenACC directives for C/C++/Fortran, TBB-Thread Building Blocks, CILK - Lightweight threads embedded into C, java threads, … – Accelerator . OpenMP4.x, OpenACC directives for C/C++/Fortran, CUDA& OpenCL APIs, libspe, ATI,, StarPU (INRIA), SequenceL, VHDL for FPGA, … – Distributed memory . MPI, Sockets, PGAS (UPC, CAF…), … Strong focus and development effort for OpenMP (IBM, NVIDIA, INTEL) © 2015 IBM Corporation 18 High Performance Programming overview . For a programmer language should not be the barrier. The critical points are – To identify and extract parallelism – the programming model, as from language to language mainly syntax changes Scalar computing THINK PARALLEL Data Vector Aware computing Computing Express Express Data Identify Parallelism Optimize Parallelism Locality critical © 2015 IBM Corporation 19 Before chosing a programming model & languages 1. What paralellism you could extract? 2. What are the characteristics of your application? 3. Which curve are you on? 4. What are the current performances? Performance # of Cores 5. What performance do you need? 6. When do you want to reach your target? 7. What’s the life span of your application, versus hardware life span? 8. What are your technical resources and skills? © 2015 IBM Corporation 20 Programming models for HPC . The challenge is to efficiently map a problem to the architecture – Address parallel paradigms for large futures systems (vector, threading, data-parallel and transfers, message-passing, accelerator…) – Address scalability – Take advantage of all computational resources – Support well performance programming – Take advantage of advances in compiler – Interoperable with existing languages – Guaranty portability . For a programmer language should not be the barrier. The critical point is the programming model supported, other criteria: portability, simplicity, efficiency, readability . Main languages for traditional HPC applications: – C/C++, Fortran, Python, R . Languages evolution: more parallelism and hybrid computing feature (C++17, OpenMP 4.5, OpenACC 3.0, UPC, CAF …) © 2015 IBM Corporation Beyond multi-core and parallelism . The problem is not multi-node, multi-core, many-core, … But . The problem is in the application programmer’s head – Do I have parallelism? – What is the right programming model for concurrency or/and heterogeneity, efficiency, readability, manageability, …? – Address clusters, SMPs, multi-cores, accelerators… . Common trends – more and more processes and threads – Data centric . How to estimate the development cost and impacts for . entrance . exit 21 © 2015 IBM Corporation 22 Vectorization overview each current and future core has vector units © 2015 IBM Corporation SIMD – Single Instruction Multiple Data . Scalar Processing . SIMD Processing – Traditional mode – One operation produces multiple results – One operation produces one result • parallel vector operations • applies the same operation in parallel on a number of data items packed into a 128-512-bit vector (2-8 DP operation per cycle) • Without vector operation peak performance must divide by vector lenght • There are many different versions of SIMD extensions • SSE, AVX, AVX2, AVX-512, Altivec, VMX 23 © 2015 IBM Corporation 24 Vectorization example - Single DAXPY : A*X Plus Y . 3 ways to enable vector operations: compiler, library and Instrinsic APIs Using the compiler (« portable

HPC System Architecture, Programming Models, Compilers

Swri IR&D Program 2016

A Functional Approach to Finding Answer Sets

Bryant Nelson

Designing Interdisciplinary Approaches to Problem Solving Into Computer Languages

Comparative Programming Languages CM20253

Your Software Faster, Better, Sooner

Exploring the Current Landscape of Programming Paradigm and Models

Automatic Parallelization Tools: a Review Varsha

Automatic Concurrency in Sequencel

AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K

Mali-T880 Launch Slides

A Parallel Communication Architecture for The