Rcuda, an Approach to Provide Remote Access to GPU
Total Page:16
File Type:pdf, Size:1020Kb
An approach to provide remote access to GPU computational power Rafael Mayo Gual University Jaume I, Spain Joint research effort Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 1/84 Outline ● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionality ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 2/84 Outline ● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 3/84 GPU computing ● GPU computing includes all the technological issues (hardware and software) for using the GPU computational power for executing general purpose code. ● This leads to a heterogeneous system. ● GPU computing has had a great growing in the last years. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 4/84 GPU computing ● Nov 2008, top500 list: ● First supercomputer on top500 (#29) using GPU computing at Tokyo Institute of Technology. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 5/84 GPU computing Rank Site Vendor RIKEN Advanced Institute for Computational K computer, SPARC64 VIIIfx 2.0GHz, Tofu 1 Science (AICS) Japan interconnect / 2011 Fujitsu National Supercomputing Center in Tianjin Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, 2 China NVIDIA GPU, FT-1000 8C / 2010 NUDT DOE/SC/Oak Ridge National Laboratory United Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 3 States 2009 Cray Inc. National Supercomputing Centre in Shenzhen Nebulae - Dawning TC3600 Blade, Intel X5650, 4 (NSCS) China NVidia Tesla C2050 GPU / 2010 Dawning TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon GSIC Center, Tokyo Institute of Technology 5 6C X5670, Nvidia GPU, Linux/Windows / 2010 Japan NEC/HP Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 6/84 GPU computing Rank Site Vendor RIKEN Advanced Institute for Computational K computer, SPARC64 VIIIfx 2.0GHz, Tofu 1 Science (AICS) Japan interconnect / 2011 Fujitsu National Supercomputing Center in Tianjin Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, 2 China nVidia GPU, FT-1000 8C / 2010 NUDT DOE/SC/Oak Ridge National Laboratory United Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 3 States 2009 Cray Inc. National Supercomputing Centre in Shenzhen Nebulae - Dawning TC3600 Blade, Intel X5650, 4 (NSCS) China nVidia Tesla C2050 GPU / 2010 Dawning TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon GSIC Center, Tokyo Institute of Technology 5 6C X5670, nVidia GPU, Linux/Windows / 2010 Japan NEC/HP Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 7/84 GPU computing JUNE 2011 Green500 Site Computer Rank 1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 DEGIMA Cluster, Intel i5, ATI Radeon GPU, 3 Nagasaki University Infiniband QDR HP Proliant SL 390s G7 Xeon 6C, nVidia GPU, 4 GSIC Center, Tokyo Institute of Technology Linux/Windows IdataPlex DX360M3, Xeon 2.4, nVidia GPU, 5 CINECA/SCS – Supercomputing Solution Infiniband Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 8/84 GPU computing JUNE 2011 Green500 Site Computer Rank 1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 DEGIMA Cluster, Intel i5, ATI Radeon GPU, 3 Nagasaki University Infiniband QDR HP Proliant SL 390s G7 Xeon 6C, nVidia GPU, 4 GSIC Center, Tokyo Institute of Technology Linux/Windows IdataPlex DX360M3, Xeon 2.4, nVidia GPU, 5 CINECA/SCS – Supercomputing Solution Infiniband Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 9/84 GPU computing ● GPU processors have been the first commodity massively parallel processors. ● For the right kind of code the use of GPUs brings huge benefits in terms of performance and energy. ● Development tools have been introduced in order to ease the programming of the GPUs. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 10/84 Rafael Mayo Gual ● Basic construction node: Basic construction ➔ GPUs insideGPUs box. the CPU GPU computing GPU GPU GPU mem 2011 HPC Advisory Council China Workshop GPU GPU mem PCI-e Main Memory Main CPU Network Network 11 / 84 Rafael Mayo Gual ● Basic construction node: Basic construction ➔ GPUs outside the CPUGPUsthe box. outside GPU computing GPU GPU GPU mem GPU GPU mem 2011 HPC Advisory Council China Workshop GPU GPU mem GPU GPU mem Main Memory Main CPU PCI-e Network Network 12 / 84 84 / 13 Main Memory mem mem GPU GPU Network CPU PCI-e Main Memory mem mem GPU GPU Network CPU PCI-e Main Memory mem mem GPU GPU Network CPU PCI-e Main Memory mem mem GPU GPU Network CPU PCI-e Interconnection Network Interconnection Main Memory 2011 HPC Advisory Council China Workshop mem mem GPU GPU Network CPU PCI-e Main Memory one or more CPUs (with several or cores more (with per CPU) several one CPUs or more (1-4).one GPUs mem mem GPU computing GPU ➔ ➔ GPU GPU Network CPU An interconnection network. interconnection An A set of nodes, each one with: one set each of nodes, A PCI-e ➔ ➔ From the programming point of view: of point programming the From ● Rafael Mayo Gual GPU computing ● Two main approaches in GPU computing development environments: ● CUDA → nVidia proprietary ● OpenCL → open standard Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 14/84 GPU computing ● Basically OpenCL and CUDA have the same work scheme: Compilation ● Separate – CPU code. – GPU code (GPU kernel). Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 15/84 GPU computing ● Basically OpenCL and CUDA have the same work scheme: Running: ✗ Data transfers: CPU and GPU memory spaces 1.Before GPU kernel execution: data from CPU memory space to GPU memory space. 2.Computation: Kernel execution. 3.After GPU kernel execution: results from GPU memory space to CPU memory space. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 16/84 GPU computing ● What means the right kind of code? ● There must be data parallelism in the code: this is the only way of taking benefit of the hundreds of processors in a GPU. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 17/84 GPU computing ● What means the right kind of code? ● There must be a limited overhead due to data movement between the CPU memory space and the GPU memory space. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 18/84 GPU computing ● What means the right kind of code? Influence of data transfers for SGEMM 100 Pinned Memory 90 ) Non-Pinned Memory % ( 80 s r e f 70 s n a r t 60 a t a 50 d o t 40 d e t o 30 v e d 20 e m i T 10 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Matrix Size Matrix computations on graphics processors and clusters of GPUs Francisco D. Igual Peña. Phd Degree dissertation. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 19/84 Outline ● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 20/84 GPU computing scenarios Different scenarios from the point of view of the application: ● Low amount of data parallelism. ● High amount of data parallelism. ● Moderate amount of data parallelism. ● Applications for multi-GPUcomputing. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 21/84 GPU computing scenarios ● Low amount of data parallelism: Application has a little part where data parallelism can be extracted. BAD for GPUcomputing No GPU is needed in the system, just proceed with the tradicional HPC strategies Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 22/84 GPU computing scenarios ● High amount of data parallelism. A lot of data parallellism can be extracted from every application. GOOD for GPUcomputing Add as many as possible GPUs to each node in the system and rewrite the applications in order to use them. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 23/84 GPU computing scenarios ● Moderate amount of data parallelism Application has a moderate level of data parallelism (≈[40%-80%]) What about GPU computing? If every node in the system includes GPUs, these GPUs are used only when data parallelism appears in some part of the application. Rest of the time GPUs are idle, and this is an overcost in both adquisition and maintenace (energy). Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 24/84 GPU computing scenarios ● Applications for multi-GPUcomputing An application can use in parallel a great amount of GPUs. What about GPUcomputing? The code running in a core can only access to the GPUs in that core but it can run faster if it would be possible to access to more GPUs. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 25/84 Outline ● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 26/84 Introduction to rCUDA A tool that enables that a code running in one node can access to GPUs in other node. It is useful when you have: ● Moderate level of data parallelism. ● Applications for multi GPUcomputing. Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 27/84 84 / 28 Main Memory mem mem GPU GPU Network CPU PCI-e Main Memory mem mem GPU GPU Network CPU PCI-e Main Memory mem mem GPU GPU Network CPU PCI-e Main Memory mem mem Interconnection Network Interconnection GPU GPU Network CPU PCI-e 2011 HPC Advisory Council China Workshop Main Memory mem mem GPU GPU Network CPU PCI-e Introduction to Introduction rCUDA long periods.