Venetis University of Patras

Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? ▶ Aren’t we here to talk about GPUs? ▶ And how to program them with CUDA? ▶ Yes, but we need to understand their place and their purpose in modern High Performance Systems ▶ This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 in Wuxi Sunway 10.649.600 - 93.014,6 125.435,9 15.371 China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 (CSCS) , NVIDIA Tesla P100 361.760 297.920 19.590,0 25.326,3 2.272 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 United States NVIDIA K20x 560.640 261.632 17.590,0 27.112,5 8.209 Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do we build an HPC system? ▶ Limitations in technology ▶ It is impossible to fit all computational resources we require into a single chip ▶ We have to build our system hierarchically 5 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ All modern processors are “multi-core” ▶ Multiple, independent processors are placed on the same chip ▶ They might also support Simultaneous Multi-Threading (SMT) ▶ Every core is capable to execute more flows of instructions ▶ Threads ▶ However, these share most of the functional units of each core ▶ 1st level of parallelism ▶ Typically 4, 8 or 16 cores 6 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card ▶ 1 or more processors are placed on a compute card ▶ Typically, a single compute card operates as a shared memory system ▶ It usually contains 1, 2 or 4 processors 7 Introduction to GPU/Parallel Computing www.prace-ri.eu Node ▶ Multiple compute cards are placed in a node ▶ There is no shared memory among compute cards ▶ The interconnection network among compute cards can be implemented in may different ways ▶ Usually there exists 1 or more additional compute cards that are dedicated to manage communication with the rest of the nodes 8 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack ▶ Multiple nodes are placed in a rack ▶ There is no shared memory among nodes of a rack ▶ The interconnection network among nodes can be implemented in may different ways ▶ Not necessarily in the same way that compute cards are connected within a single node 9 Introduction to GPU/Parallel Computing www.prace-ri.eu The whole system ▶ Multiple racks are connected ▶ Typically there are dedicated nodes that handle I/O 10 Introduction to GPU/Parallel Computing www.prace-ri.eu Hierarchical parallelism IBM BlueGene/P or 11 Introduction to GPU/Parallel Computing www.prace-ri.eu Examples of modern High Performance Systems 12 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight (No 1, Top 500 list, June 2017) ▶ Computing node ▶ Basic element of the architecture ▶ 256 computing nodes create a super node ▶ Super nodes are connected through the central switch network Sources of images: • The Sunway TaihuLight supercomputer: system and applications. Fu, H., Liao, J., Yang, J. et al. Sci. China Inf. Sci. (2016) 59: 072001. doi:10.1007/s11432-016-5588-7 • Report on the Sunway TaihuLight System Dongara, J., Tech Report UT-EECS-16-742, June 2016. 13 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ SW26010 ▶ One of the few systems the rely on a custom made processor ▶ Designed by the Shanghai High Performance IC Design Center ▶ Characteristic example of a heterogeneous many-core processor ▶ Composed of 2 types of different cores 14 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ Contains 4 Core Groups (CGs) ▶ Connected through a Network On Chip (NoC) ▶ Each CG is composed of: ▶ 1 Management Processing Element (MPE) ▶ 64 Computing Processing Elements (CPEs) ▶ Placed on a 8x8 grid 15 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ Each CG has a distinct address space ▶ Connected to the MPE and the CPEs through a Memory Controller (MC) ▶ Each processor connects to the rest of the system through the System Interface (SI) 16 Introduction to GPU/Parallel Computing www.prace-ri.eu The two types of cores ▶ Management Processing Element (MPE) ▶ Complete 64-bit RISC core ▶ Executes instructions in user and system modes, handles interrupts, memory mamagement, superscalar, out-of-order execution, … ▶ Performs all management and communication tasks ▶ Computing Processing Element (CPE) ▶ Reduced capability 64-bit RISC core ▶ Executes instructions only in user mode, does not handle interrupts, … ▶ Objectives of the design ▶ Maximum overall performance ▶ Reduced design complexity ▶ Placed on an 8x8 grid ▶ Allows for fast exchange of data directly between registers 17 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card ▶ 2 processors 18 Introduction to GPU/Parallel Computing www.prace-ri.eu Node ▶ 4 compute cards ▶ 2 σε κάθε μεριά 19 Introduction to GPU/Parallel Computing www.prace-ri.eu Supernode ▶ 32 nodes (256 processors) 20 Introduction to GPU/Parallel Computing www.prace-ri.eu Cabinet ▶ 4 supernodes (1024 processors) 21 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight ▶ 40 cabinets 22 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 10.649.600 Peak performance 125,436 PFlops Linpack performance 93,015 PFlops CPU frequency 1,45 GHz Peak performance of a CPU 3,06 TFlops Total memory 1310,72 TB Total memory bandwidth 5591,5 TB/s Network link bandwidth 16 GB/s Network bisection bandwidth 70 TB/s Network diameter 7 Total storage 20 PB Total I/O bandwidth 288 GB/s Power consumption when running the Linpack test 15,371 MW Performance power ratio 6,05 GFlops/W 23 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 (No 2, Top 500 list, June 2017) ▶ In contrast to Synway TaihuLight it has typical/commercial processors ▶ Intel Xeon E5-2692 ▶ 12 cores ▶ 2.2 GHz ▶ To achieve high performance it uses coprocessors ▶ Intel Xeon Phi 31S1P ▶ 57 cores ▶ 4-way SMT ▶ 1.1 GHz ▶ PCI-E 2.0 interconnect with the host system 24 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card ▶ Contains 2 processors and 3 Xeon Phi 25 Introduction to GPU/Parallel Computing www.prace-ri.eu Node ▶ Contains 2 compute cards ▶ Special interconnection 26 Introduction to GPU/Parallel Computing www.prace-ri.eu Frame ▶ 16 nodes 27 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack ▶ 4 frames 28 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 ▶ 125 racks 29 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 3.120.000 Peak performance 54,902 PFlops Linpack performance 33,863 PFlops CPU frequency 2,2 GHz / 1,1 GHz Total memory 1.404 TB Total storage 12,4 PB Total I/O bandwidth 100 GB/s Power consumption when running Linpack 17,808 MW Performance power ratio 1,9 GFlops/W 30 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan (No 4, Top 500 list, June 2017) ▶ Also consists of typical/commercial processors ▶ AMD Opteron 6274 ▶ 16 cores ▶ 2.2 GHz ▶ To achieve high performance it uses coprocessors ▶ NVidia K20x ▶ 2688 cores ▶ 732 MHz ▶ PCI-E 2.0 interconnect with the host system 31 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card / Node ▶ Contains 1 processor + 1 GPU ▶ 2 nodes share the router of the interconnection network Z Y X 32 Introduction to GPU/Parallel Computing www.prace-ri.eu Blade / Cabinet ▶ Each blade contains 4 nodes ▶ Each cabinet contains 24 blades 33 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan ▶ 200 cabinets 34 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 560.640 Peak performance 27,113 PFlops Linpack performance 17,590 Pflops CPU frequency 2,2 GHz / 2,2 GHz Total memory 710 TB Total storage 40 PB Total I/O bandwidth 1,4 TB/s Power consumption when running Linpack 8,209 MW Performance power ratio 2,1 GFlops/W 35 Introduction to GPU/Parallel Computing www.prace-ri.eu Comparison Sunway Tianhe-2 Titan TaihuLight Cores 10.649.600 3.120.000 560.640 Peak performance 125,436 PFlops 54,902 PFlops 27,113 PFlops Linpack performance 93,015 PFlops 33,863 PFlops 17,590 Pflops CPU frequency 1,45 GHz 2,2 GHz / 1,1 GHz 2,2 GHz / 2,2 GHz Total memory 1310,72 TB 1.404 TB 710 TB Total storage 20 PB 12,4 PB 40 PB Total I/O bandwidth 288 GB/s 100 GB/s 1,4 TB/s Power consumption for Linpack 15,371 MW 17,808 MW 8,209 MW Performance power ratio 6,05 GFlops/W 1,9 GFlops/W 2,1 GFlops/W 36 Introduction to GPU/Parallel Computing www.prace-ri.eu Power consumption ▶ Average daily power consumption per household: 11 KWh ▶ http://www.cres.gr/pepesec/apotelesmata.html ▶ Small study, but gives a picture ▶ Tianhe-2: 17.808 KW * 24 hours = 427.392 KWh ▶ Consumes as much as 38.854 households per day! ▶ If on average 3 people live an household: ▶ 38.854 * 3

Venetis University of Patras

Towards Exascale Computing

Computational PHYSICS Shuai Dong

FCMSSR Meeting 2018-01 All Slides

This Is Your Presentation Title

It's a Multi-Core World

Joaovicentesouto-Tcc.Pdf

Parallel Processing with the MPPA Manycore Processor

Optimizing High-Resolution Community Earth System

A Preliminary Port and Evaluation of the Uintah AMT Runtime on Sunway Taihulight

Eithne: a Framework for Benchmarking Micro-Core Accelerators

Technologies and Tools for High-Performance Distributed

Comparative HPC Performance Powerpoint