Venetis University of Patras

Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? ▶ Aren’t we here to talk about GPUs? ▶ And how to program them with CUDA? ▶ Yes, but we need to understand their place and their purpose in modern High Performance Systems ▶ This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 in Wuxi Sunway 10.649.600 - 93.014,6 125.435,9 15.371 China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 (CSCS) , NVIDIA Tesla P100 361.760 297.920 19.590,0 25.326,3 2.272 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 United States NVIDIA K20x 560.640 261.632 17.590,0 27.112,5 8.209 Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do we build an HPC system? ▶ Limitations in technology ▶ It is impossible to fit all computational resources we require into a single chip ▶ We have to build our system hierarchically 5 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ All modern processors are “multi-core” ▶ Multiple, independent processors are placed on the same chip ▶ They might also support Simultaneous Multi-Threading (SMT) ▶ Every core is capable to execute more flows of instructions ▶ Threads ▶ However, these share most of the functional units of each core ▶ 1st level of parallelism ▶ Typically 4, 8 or 16 cores 6 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card ▶ 1 or more processors are placed on a compute card ▶ Typically, a single compute card operates as a shared memory system ▶ It usually contains 1, 2 or 4 processors 7 Introduction to GPU/Parallel Computing www.prace-ri.eu Node ▶ Multiple compute cards are placed in a node ▶ There is no shared memory among compute cards ▶ The interconnection network among compute cards can be implemented in may different ways ▶ Usually there exists 1 or more additional compute cards that are dedicated to manage communication with the rest of the nodes 8 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack ▶ Multiple nodes are placed in a rack ▶ There is no shared memory among nodes of a rack ▶ The interconnection network among nodes can be implemented in may different ways ▶ Not necessarily in the same way that compute cards are connected within a single node 9 Introduction to GPU/Parallel Computing www.prace-ri.eu The whole system ▶ Multiple racks are connected ▶ Typically there are dedicated nodes that handle I/O 10 Introduction to GPU/Parallel Computing www.prace-ri.eu Hierarchical parallelism IBM BlueGene/P or 11 Introduction to GPU/Parallel Computing www.prace-ri.eu Examples of modern High Performance Systems 12 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight (No 1, Top 500 list, June 2017) ▶ Computing node ▶ Basic element of the architecture ▶ 256 computing nodes create a super node ▶ Super nodes are connected through the central switch network Sources of images: • The Sunway TaihuLight supercomputer: system and applications. Fu, H., Liao, J., Yang, J. et al. Sci. China Inf. Sci. (2016) 59: 072001. doi:10.1007/s11432-016-5588-7 • Report on the Sunway TaihuLight System Dongara, J., Tech Report UT-EECS-16-742, June 2016. 13 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ SW26010 ▶ One of the few systems the rely on a custom made processor ▶ Designed by the Shanghai High Performance IC Design Center ▶ Characteristic example of a heterogeneous many-core processor ▶ Composed of 2 types of different cores 14 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ Contains 4 Core Groups (CGs) ▶ Connected through a Network On Chip (NoC) ▶ Each CG is composed of: ▶ 1 Management Processing Element (MPE) ▶ 64 Computing Processing Elements (CPEs) ▶ Placed on a 8x8 grid 15 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor ▶ Each CG has a distinct address space ▶ Connected to the MPE and the CPEs through a Memory Controller (MC) ▶ Each processor connects to the rest of the system through the System Interface (SI) 16 Introduction to GPU/Parallel Computing www.prace-ri.eu The two types of cores ▶ Management Processing Element (MPE) ▶ Complete 64-bit RISC core ▶ Executes instructions in user and system modes, handles interrupts, memory mamagement, superscalar, out-of-order execution, … ▶ Performs all management and communication tasks ▶ Computing Processing Element (CPE) ▶ Reduced capability 64-bit RISC core ▶ Executes instructions only in user mode, does not handle interrupts, … ▶ Objectives of the design ▶ Maximum overall performance ▶ Reduced design complexity ▶ Placed on an 8x8 grid ▶ Allows for fast exchange of data directly between registers 17 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card ▶ 2 processors 18 Introduction to GPU/Parallel Computing www.prace-ri.eu Node ▶ 4 compute cards ▶ 2 σε κάθε μεριά 19 Introduction to GPU/Parallel Computing www.prace-ri.eu Supernode ▶ 32 nodes (256 processors) 20 Introduction to GPU/Parallel Computing www.prace-ri.eu Cabinet ▶ 4 supernodes (1024 processors) 21 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight ▶ 40 cabinets 22 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 10.649.600 Peak performance 125,436 PFlops Linpack performance 93,015 PFlops CPU frequency 1,45 GHz Peak performance of a CPU 3,06 TFlops Total memory 1310,72 TB Total memory bandwidth 5591,5 TB/s Network link bandwidth 16 GB/s Network bisection bandwidth 70 TB/s Network diameter 7 Total storage 20 PB Total I/O bandwidth 288 GB/s Power consumption when running the Linpack test 15,371 MW Performance power ratio 6,05 GFlops/W 23 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 (No 2, Top 500 list, June 2017) ▶ In contrast to Synway TaihuLight it has typical/commercial processors ▶ Intel Xeon E5-2692 ▶ 12 cores ▶ 2.2 GHz ▶ To achieve high performance it uses coprocessors ▶ Intel Xeon Phi 31S1P ▶ 57 cores ▶ 4-way SMT ▶ 1.1 GHz ▶ PCI-E 2.0 interconnect with the host system 24 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card ▶ Contains 2 processors and 3 Xeon Phi 25 Introduction to GPU/Parallel Computing www.prace-ri.eu Node ▶ Contains 2 compute cards ▶ Special interconnection 26 Introduction to GPU/Parallel Computing www.prace-ri.eu Frame ▶ 16 nodes 27 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack ▶ 4 frames 28 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 ▶ 125 racks 29 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 3.120.000 Peak performance 54,902 PFlops Linpack performance 33,863 PFlops CPU frequency 2,2 GHz / 1,1 GHz Total memory 1.404 TB Total storage 12,4 PB Total I/O bandwidth 100 GB/s Power consumption when running Linpack 17,808 MW Performance power ratio 1,9 GFlops/W 30 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan (No 4, Top 500 list, June 2017) ▶ Also consists of typical/commercial processors ▶ AMD Opteron 6274 ▶ 16 cores ▶ 2.2 GHz ▶ To achieve high performance it uses coprocessors ▶ NVidia K20x ▶ 2688 cores ▶ 732 MHz ▶ PCI-E 2.0 interconnect with the host system 31 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card / Node ▶ Contains 1 processor + 1 GPU ▶ 2 nodes share the router of the interconnection network Z Y X 32 Introduction to GPU/Parallel Computing www.prace-ri.eu Blade / Cabinet ▶ Each blade contains 4 nodes ▶ Each cabinet contains 24 blades 33 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan ▶ 200 cabinets 34 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview Cores 560.640 Peak performance 27,113 PFlops Linpack performance 17,590 Pflops CPU frequency 2,2 GHz / 2,2 GHz Total memory 710 TB Total storage 40 PB Total I/O bandwidth 1,4 TB/s Power consumption when running Linpack 8,209 MW Performance power ratio 2,1 GFlops/W 35 Introduction to GPU/Parallel Computing www.prace-ri.eu Comparison Sunway Tianhe-2 Titan TaihuLight Cores 10.649.600 3.120.000 560.640 Peak performance 125,436 PFlops 54,902 PFlops 27,113 PFlops Linpack performance 93,015 PFlops 33,863 PFlops 17,590 Pflops CPU frequency 1,45 GHz 2,2 GHz / 1,1 GHz 2,2 GHz / 2,2 GHz Total memory 1310,72 TB 1.404 TB 710 TB Total storage 20 PB 12,4 PB 40 PB Total I/O bandwidth 288 GB/s 100 GB/s 1,4 TB/s Power consumption for Linpack 15,371 MW 17,808 MW 8,209 MW Performance power ratio 6,05 GFlops/W 1,9 GFlops/W 2,1 GFlops/W 36 Introduction to GPU/Parallel Computing www.prace-ri.eu Power consumption ▶ Average daily power consumption per household: 11 KWh ▶ http://www.cres.gr/pepesec/apotelesmata.html ▶ Small study, but gives a picture ▶ Tianhe-2: 17.808 KW * 24 hours = 427.392 KWh ▶ Consumes as much as 38.854 households per day! ▶ If on average 3 people live an household: ▶ 38.854 * 3

Venetis University of Patras

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support