This Is Your Presentation Title

Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , NVIDIA Tesla P100 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do we build an HPC system? Limitations in technology It is impossible to fit all computational resources we require into a single chip We have to build our system hierarchically 5 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor All modern processors are “multi-core” Multiple, independent processors are placed on the same chip They might also support Simultaneous Multi-Threading (SMT) Every core is capable to execute more flows of instructions Threads However, these share most of the functional units of each core 1st level of parallelism Typically 4, 8 or 16 cores 6 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card 1 or more processors are placed on a compute card Typically, a single compute card operates as a shared memory system It usually contains 1, 2 or 4 processors 7 Introduction to GPU/Parallel Computing www.prace-ri.eu Node Multiple compute cards are placed in a node There is no shared memory among compute cards The interconnection network among compute cards can be implemented in may different ways Usually there exists 1 or more additional compute cards that are dedicated to manage communication with the rest of the nodes 8 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack Multiple nodes are placed in a rack There is no shared memory among nodes of a rack The interconnection network among nodes can be implemented in may different ways Not necessarily in the same way that compute cards are connected within a single node 9 Introduction to GPU/Parallel Computing www.prace-ri.eu The whole system Multiple racks are connected Typically there are dedicated nodes that handle I/O 10 Introduction to GPU/Parallel Computing www.prace-ri.eu Hierarchical parallelism IBM BlueGene/P or 11 Introduction to GPU/Parallel Computing www.prace-ri.eu Examples of modern High Performance Systems 12 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight (No 1, Top 500 list, June 2017) Computing node Basic element of the architecture 256 computing nodes create a super node Super nodes are connected through the central switch network Sources of images: • The Sunway TaihuLight supercomputer: system and applications. Fu, H., Liao, J., Yang, J. et al. Sci. China Inf. Sci. (2016) 59: 072001. doi:10.1007/s11432-016-5588-7 • Report on the Sunway TaihuLight System Dongara, J., Tech Report UT-EECS-16-742, June 2016. 13 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor SW26010 One of the few systems the rely on a custom made processor Designed by the Shanghai High Performance IC Design Center Characteristic example of a heterogeneous many-core processor Composed of 2 types of different cores 14 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor Contains 4 Core Groups (CGs) Connected through a Network On Chip (NoC) Each CG is composed of: 1 Management Processing Element (MPE) 64 Computing Processing Elements (CPEs) Placed on a 8x8 grid 15 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor Each CG has a distinct address space Connected to the MPE and the CPEs through a Memory Controller (MC) Each processor connects to the rest of the system through the System Interface (SI) 16 Introduction to GPU/Parallel Computing www.prace-ri.eu The two types of cores Management Processing Element (MPE) Complete 64-bit RISC core Executes instructions in user and system modes, handles interrupts, memory mamagement, superscalar, out-of-order execution, … Performs all management and communication tasks Computing Processing Element (CPE) Reduced capability 64-bit RISC core Executes instructions only in user mode, does not handle interrupts, … Objectives of the design Maximum overall performance Reduced design complexity Placed on an 8x8 grid Allows for fast exchange of data directly between registers 17 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card 2 processors 18 Introduction to GPU/Parallel Computing www.prace-ri.eu Node 4 compute cards 2 σε κάθε μεριά 19 Introduction to GPU/Parallel Computing www.prace-ri.eu Supernode 32 nodes (256 processors) 20 Introduction to GPU/Parallel Computing www.prace-ri.eu Cabinet 4 supernodes (1024 processors) 21 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight 40 cabinets 22 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 10.649.600 Peak performance 125,436 PFlops Linpack performance 93,015 PFlops CPU frequency 1,45 GHz Peak performance of a CPU 3,06 TFlops Total memory 1310,72 TB Total memory bandwidth 5591,5 TB/s Network link bandwidth 16 GB/s Network bisection bandwidth 70 TB/s Network diameter 7 Total storage 20 PB Total I/O bandwidth 288 GB/s Power consumption when running the Linpack test 15,371 MW Performance power ratio 6,05 GFlops/W 23 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 (No 2, Top 500 list, June 2017) In contrast to Synway TaihuLight it has typical/commercial processors Intel Xeon E5-2692 12 cores 2.2 GHz To achieve high performance it uses coprocessors Intel Xeon Phi 31S1P 57 cores 4-way SMT 1.1 GHz PCI-E 2.0 interconnect with the host system 24 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card Contains 2 processors and 3 Xeon Phi 25 Introduction to GPU/Parallel Computing www.prace-ri.eu Node Contains 2 compute cards Special interconnection 26 Introduction to GPU/Parallel Computing www.prace-ri.eu Frame 16 nodes 27 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack 4 frames 28 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 125 racks 29 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 3.120.000 Peak performance 54,902 PFlops Linpack performance 33,863 PFlops CPU frequency 2,2 GHz / 1,1 GHz Total memory 1.404 TB Total storage 12,4 PB Total I/O bandwidth 100 GB/s Power consumption when running Linpack 17,808 MW Performance power ratio 1,9 GFlops/W 30 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan (No 4, Top 500 list, June 2017) Also consists of typical/commercial processors AMD Opteron 6274 16 cores 2.2 GHz To achieve high performance it uses coprocessors NVidia K20x 2688 cores 732 MHz PCI-E 2.0 interconnect with the host system 31 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card / Node Contains 1 processor + 1 GPU 2 nodes share the router of the interconnection network Z Y X 32 Introduction to GPU/Parallel Computing www.prace-ri.eu Blade / Cabinet Each blade contains 4 nodes Each cabinet contains 24 blades 33 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan 200 cabinets 34 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 560.640 Peak performance 27,113 PFlops Linpack performance 17,590 Pflops CPU frequency 2,2 GHz / 2,2 GHz Total memory 710 TB Total storage 40 PB Total I/O bandwidth 1,4 TB/s Power consumption when running Linpack 8,209 MW Performance power ratio 2,1 GFlops/W 35 Introduction to GPU/Parallel Computing www.prace-ri.eu Comparison Sunway Tianhe-2 Titan TaihuLight Cores 10.649.600 3.120.000 560.640 Peak performance 125,436 PFlops 54,902 PFlops 27,113 PFlops Linpack performance 93,015 PFlops 33,863 PFlops 17,590 Pflops CPU frequency 1,45 GHz 2,2 GHz / 1,1 GHz 2,2 GHz / 2,2 GHz Total memory 1310,72 TB 1.404 TB 710 TB Total storage 20 PB 12,4 PB 40 PB Total I/O bandwidth 288 GB/s 100 GB/s 1,4 TB/s Power consumption for Linpack 15,371 MW 17,808 MW 8,209 MW Performance power ratio 6,05 GFlops/W 1,9 GFlops/W 2,1 GFlops/W 36 Introduction to GPU/Parallel Computing www.prace-ri.eu Power consumption Average daily power consumption per household: 11 KWh http://www.cres.gr/pepesec/apotelesmata.html Small study, but gives a picture Tianhe-2: 17.808 KW * 24 hours = 427.392 KWh Consumes as much as 38.854 households per day! If on average 3 people live an household: 38.854 * 3 = 116.562 Volos: 6η largest city in Greece 120.000 citicens (2011 census) 37 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming High Performance Systems 38 Introduction to GPU/Parallel Computing

This Is Your Presentation Title

An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System

Petaflops for the People

SIMD Extensions

Towards Exascale Computing

Computational PHYSICS Shuai Dong

Unlocking the Full Potential of the Cray XK7 Accelerator

FCMSSR Meeting 2018-01 All Slides

TECHNICAL GUIDELINES for APPLICANTS to PRACE 17Th CALL

It's a Multi-Core World

Hardware Complexity and Software Challenges

Joaovicentesouto-Tcc.Pdf

A Performance Analysis of the First Generation of HPC-Optimized Arm Processors