An approach to provide remote access to GPU computational power

Rafael Mayo Gual University Jaume I, Spain

Joint research effort

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 1/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionality ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 2/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 3/84 GPU computing

● GPU computing includes all the technological issues (hardware and software) for using the GPU computational power for executing general purpose code.

● This leads to a heterogeneous system.

● GPU computing has had a great growing in the last years.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 4/84 GPU computing

● Nov 2008, top500 list: ● First supercomputer on top500 (#29) using GPU computing at Tokyo Institute of Technology.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 5/84 GPU computing

Rank Site Vendor

RIKEN Advanced Institute for Computational K computer, SPARC64 VIIIfx 2.0GHz, Tofu 1 Science (AICS) Japan interconnect / 2011 Fujitsu

National Supercomputing Center in Tianjin Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, 2 China GPU, FT-1000 8C / 2010 NUDT

DOE/SC/Oak Ridge National Laboratory United Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 3 States 2009 Cray Inc. National Supercomputing Centre in Shenzhen Nebulae - Dawning TC3600 Blade, Intel X5650, 4 (NSCS) China C2050 GPU / 2010 Dawning TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon GSIC Center, Tokyo Institute of Technology 5 6C X5670, Nvidia GPU, Linux/Windows / 2010 Japan NEC/HP

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 6/84 GPU computing

Rank Site Vendor

RIKEN Advanced Institute for Computational K computer, SPARC64 VIIIfx 2.0GHz, Tofu 1 Science (AICS) Japan interconnect / 2011 Fujitsu

National Supercomputing Center in Tianjin Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, 2 China nVidia GPU, FT-1000 8C / 2010 NUDT

DOE/SC/Oak Ridge National Laboratory United Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 3 States 2009 Cray Inc. National Supercomputing Centre in Shenzhen Nebulae - Dawning TC3600 Blade, Intel X5650, 4 (NSCS) China nVidia Tesla C2050 GPU / 2010 Dawning TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon GSIC Center, Tokyo Institute of Technology 5 6C X5670, nVidia GPU, Linux/Windows / 2010 Japan NEC/HP

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 7/84 GPU computing

JUNE 2011

Green500 Site Computer Rank

1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2

2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2

DEGIMA Cluster, Intel i5, ATI Radeon GPU, 3 Nagasaki University Infiniband QDR

HP Proliant SL 390s G7 Xeon 6C, nVidia GPU, 4 GSIC Center, Tokyo Institute of Technology Linux/Windows IdataPlex DX360M3, Xeon 2.4, nVidia GPU, 5 CINECA/SCS – Supercomputing Solution Infiniband

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 8/84 GPU computing

JUNE 2011

Green500 Site Computer Rank

1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2

2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2

DEGIMA Cluster, Intel i5, ATI Radeon GPU, 3 Nagasaki University Infiniband QDR

HP Proliant SL 390s G7 Xeon 6C, nVidia GPU, 4 GSIC Center, Tokyo Institute of Technology Linux/Windows IdataPlex DX360M3, Xeon 2.4, nVidia GPU, 5 CINECA/SCS – Supercomputing Solution Infiniband

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 9/84 GPU computing

● GPU processors have been the first commodity massively parallel processors.

● For the right kind of code the use of GPUs brings huge benefits in terms of performance and energy.

● Development tools have been introduced in order to ease the programming of the GPUs.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 10/84 GPU computing

● Basic construction node: ➔ GPUs inside the CPU box.

Main Memory U U m m e e P P k r G G m m k r o o w t w U e U CPU t e P P N N G G

PCI-e

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 11/84 GPU computing

● Basic construction node: ➔ GPUs outside the CPU box.

Main Memory U U U U m m m m e e e e P P P P G G G G m m m m k r k r U U U U o o P P P P w t w G G G G e CPU t e N N

PCI-e

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 12/84 GPU computing

● From the programming point of view: ➔ A set of nodes, each one with: ➔ one or more CPUs (with several cores per CPU) ➔ one or more GPUs (1-4). ➔ An interconnection network.

GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem

GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem

P P P P P P

C C

C C C C

M M M M M M

I I

I I I I

- - -

- - -

a a a

a a a

e e

e e e e

i i

i i i i

n n n n

n n

M M CPU CPU M CPU M CPU CPU M CPU M

e e

e e e e

m m m

m m m

o o o

o o o

r r r r

r r

y y y

y y y

Network Network Network Network Network Network

Interconnection Network

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 13/84 GPU computing

● Two main approaches in GPU computing development environments: ● CUDA → nVidia proprietary ● OpenCL → open standard

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 14/84 GPU computing

● Basically OpenCL and CUDA have the same work scheme: Compilation ● Separate – CPU code. – GPU code (GPU kernel).

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 15/84 GPU computing

● Basically OpenCL and CUDA have the same work scheme: Running: ✗ Data transfers: CPU and GPU memory spaces 1.Before GPU kernel execution: data from CPU memory space to GPU memory space. 2.Computation: Kernel execution. 3.After GPU kernel execution: results from GPU memory space to CPU memory space.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 16/84 GPU computing

● What means the right kind of code?

● There must be data parallelism in the code: this is the only way of taking benefit of the hundreds of processors in a GPU.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 17/84 GPU computing

● What means the right kind of code?

● There must be a limited overhead due to data movement between the CPU memory space and the GPU memory space.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 18/84 GPU computing

● What means the right kind of code? Influence of data transfers for SGEMM 100 Pinned Memory 90 ) Non-Pinned Memory % (

80 s r e f 70 s n a r

t 60

a t

a 50 d

o t 40 d e t

o 30 v e d 20 e m i

T 10 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Matrix Size Matrix computations on graphics processors and clusters of GPUs Francisco D. Igual Peña. Phd Degree dissertation.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 19/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 20/84 GPU computing scenarios

Different scenarios from the point of view of the application: ● Low amount of data parallelism. ● High amount of data parallelism. ● Moderate amount of data parallelism. ● Applications for multi-GPUcomputing.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 21/84 GPU computing scenarios

● Low amount of data parallelism: Application has a little part where data parallelism can be extracted.

BAD for GPUcomputing No GPU is needed in the system, just proceed with the tradicional HPC strategies

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 22/84 GPU computing scenarios

● High amount of data parallelism. A lot of data parallellism can be extracted from every application.

GOOD for GPUcomputing Add as many as possible GPUs to each node in the system and rewrite the applications in order to use them.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 23/84 GPU computing scenarios

● Moderate amount of data parallelism Application has a moderate level of data parallelism (≈[40%-80%])

What about GPU computing? If every node in the system includes GPUs, these GPUs are used only when data parallelism appears in some part of the application. Rest of the time GPUs are idle, and this is an overcost in both adquisition and maintenace (energy).

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 24/84 GPU computing scenarios

● Applications for multi-GPUcomputing An application can use in parallel a great amount of GPUs.

What about GPUcomputing? The code running in a core can only access to the GPUs in that core but it can run faster if it would be possible to access to more GPUs.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 25/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 26/84 Introduction to rCUDA

A tool that enables that a code running in one node can access to GPUs in other node.

It is useful when you have: ● Moderate level of data parallelism. ● Applications for multi GPUcomputing.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 27/84 Introduction to rCUDA

● Moderate level of data parallelism

GPU mem GPU mem GPU mem GPU mem GPU mem

GPU mem GPU mem GPU mem GPU mem GPU mem

P P P P P

C C

C C C

M M M M M

I

I I I I

- - - -

-

a a

a a a

e

e e e e

i i i i i

n n n n

n

M CPU M CPU CPU M CPU M CPU M

e

e e e e

m m

m m m

o o

o o o

r r r r

r

y

y y y y

Network Network Network Network Network

Interconnection Network

Adding a set of GPUs on each node leads to have a set of GPUs idle for long periods. This is a waste in money and energy

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 28/84 Introduction to rCUDA

● Moderate level of data parallelism

GPU mem GPU mem

GPU mem GPU mem

P P

C

C

M M M M M

I I

-

-

a a a a

a

e e

i

i i i i

n n n n

n

M CPU M CPU M CPU CPU M CPU M

e

e e e e

P P

m m m

m m

P

C C

o o o C

o o

I

I

r r r r

r

I

- -

y y y

y y

-

e

e

e

Network Network Network Network Network

Interconnection Network

Add only just the GPUs that can be used, considering the applications and their amount of data parallelism and

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 29/84 Introduction to rCUDA

● Moderate level of data parallelism

GPU mem GPU mem

GPU mem GPU mem

Logical interconnections

M M M M M

a a a a

a

i

i i i i

n n n n

n

M CPU M CPU M CPU CPU M CPU M

e

e e e e

P P

P P

m m m

m m

P

C

C

C C

o o o C

o o

I

I

I

I

r r r r

r

- - I

- -

y y y

y y

-

e

e

e

e

e

Network Network Network Network Network

Interconnection Network

Add only just the GPUs that can be used, considering the applications and their amount of data parallelism and make all of them accessible from every node

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 30/84 Introduction to rCUDA

● Applications for muti-GPUcomputing

GPU mem GPU mem GPU mem GPU mem GPU mem

GPU mem GPU mem GPU mem GPU mem GPU mem

P P P P P

C C C

C C

M M M M M

I I I

I I

- - -

- -

a a a a a

e e e

e e

i i

i i i

n n n n n

M M M M CPU M CPU CPU CPU CPU

e e e

e e

m m m m

m

o o o o o

r r r r r

y y y y

y

Network Network Network Network Network

Interconnection Network

From each CPU it is only possible to access to the corresponding GPUs

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 31/84 Introduction to rCUDA

● Applications for multi-GPUcomputing

GPU mem GPU mem GPU mem GPU mem GPU mem

GPU mem GPU mem GPU mem GPU mem GPU mem

Logical interconnections

M M M M M

a a a a

a

i

i i i i

n n n n

n

M CPU M CPU M CPU CPU M CPU M

e

e e e e

P P

P P

m m m

m m

P

C

C

C C

o o o C

o o

I

I

I

I

r r r r

r

- - I

- -

y y y

y y

-

e

e

e

e

e

Network Network Network Network Network

Interconnection Network

Put all GPUs accessible from every node

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 32/84 Introduction to rCUDA

● Applications for multi-GPUcomputing

GPU mem GPU mem GPU mem GPU mem GPU mem

GPU mem GPU mem GPU mem GPU mem GPU mem

Logical interconnections

M M M M M

a a a a

a

i

i i i i

n n n n

n

M CPU M CPU M CPU CPU M CPU M

e

e e e e

P P

P P

m m m

m m

P

C

C

C C

o o o C

o o

I

I

I

I

r r r r

r

- - I

- -

y y y

y y

-

e

e

e

e

e

Network Network Network Network Network

Interconnection Network

Put all GPUs accessible from every node and enable the access from a CPU to as many as GPUs are necessary

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 33/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 34/84 rCUDA structure

CUDA application

Application

CUDA driver + runtime

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 35/84 rCUDA structure

Client side CUDA application Server side

Application Application

CUDA CUDA driver + runtime driver + runtime

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 36/84 rCUDA structure

Client side CUDA application Server side

Application rCUDA daemon

rCUDA library CUDA driver + runtime

Network Network device device

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 37/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 38/84 rCUDA functionallity

CUDA programming CC extensions. extensions. RuntimeRuntime library. library.

C extensions NotNot supported supported in in the the current current version version of of rCUDA. rCUDA. WeWe are are working working on on it it

Runtime library SupportSupport for for almost almost all all functions. functions. ForFor some some internal internal functions, functions, nVidia nVidia does does not not give give information information (not (not supportedsupported in in rCUDA) rCUDA)

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 39/84 rCUDA functionallity

Supported CUDA 4.0 Runtime Functions

Module Functions Supported Device Management 13 13 Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management 47 47 Peer device memory access 3 3 Stream management 5 5 Suface reference management 2 2 Texture refefence managemet 8 8 Thread management 6 6 Unified addressing 1 1 Version managemet 2 2

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 40/84 rCUDA functionallity

NOT YET Supported CUDA 4.0 Runtime Functions

Module Functions Supported OpenGL Interoperability 4 0 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability 6 0

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 41/84 rCUDA functionallity

Supported CUBLAS Functions

Module Functions Supported Helper function reference 15 15 BLAS-1 54 13 BLAS-2 66 16 BLAS-3 30 9

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 42/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 43/84 Basic TPC/IP version

● Proof of concept

● Use TCP/IP stack

● It is a basic version to show the functionallity

● Estimation of the overhead due to the communication network.

● Runs over all TPC/IP networks: Ethernet, InfiniBand, etc.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 44/84 Basic TPC/IP version

ExampleExample ofof rCUDArCUDA interactioninteraction ➔➔InitializationInitialization

ClientClient applicationapplication Network ServerServer daemondaemon

•GPU query •Kernel software Get GPU Locate and send Load Kernel kernel Result Return result

SEND RECEIVE

•Get GPU •Load Kernel

SEND RECEIVE

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 45/84 Basic TPC/IP version

ExampleExample ofof rCUDArCUDA interactioninteraction ➔➔CudaMemcpy(CudaMemcpy( ...,..., cudaMemcpyHostToDevice)cudaMemcpyHostToDevice)

ClientClient applicationapplication Network ServerServer daemondaemon

•GPU query •Kernel software Copy data to Send data to GPU memory server Result Return result

SEND RECEIVE

•CudaMemcpy

SEND RECEIVE

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 46/84 Basic TPC/IP version

● Main problem: data movement overhead ● On CUDA this overhead is due to: ● PCIe data transfers ● On rCUDA this overhead is due to: ● PCIe data transfers ● Network data transfers

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 47/84 Basic TPC/IP version

DataData transfertransfer timetime forfor matrix-matrixmatrix-matrix multiplicationmultiplication (GEMM)(GEMM) (2(2 datadata matricesmatrices fromfrom clientclient toto remoteremote GPU)GPU) (1(1 resultresult matrixmatrix fromfrom remoteremote GPUGPU toto client)client)

35000 10Gb10Gb EthernetEthernet 30000 )

c 25000 e

s rCUDA

m 20000 CUDA (

e 15000 m i T 10000 5000 0 2 2 4 6 8 0 1 7 4 1 8 6 5 0 1 2 2 3 3 6 9 2 5 1 1 Matrix dimension

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 48/84 Basic TPC/IP version

ExecutionExecution timetime forfor matrix-matrixmatrix-matrix multiplicationmultiplication (GEMM)(GEMM)

●●TeslaTesla c1060c1060 70● ●IntelIntel XeonXeon E5410E5410 2'332'33 GhzGhz CPU ● 60●10Gb10Gb EthernetEthernet rCUDA ) n c 50 l o i e e t s n u ( r c 40 e e e k x e m

i 30 T 20 s r e a f t

10 s a n D a r 0 t 2 2 4 6 8 0 1 7 4 1 8 6 rCUDA misc 5 0 1 2 2 3 3 6 9 2 5 operations 1 1

Matrix dimension

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 49/84 Basic TPC/IP version

EstimatedEstimated executionexecution timetime forfor matrixmatrix multiplication,multiplication, includingincluding datadata transferstransfers forfor somesome HPCHPC networksnetworks

120 CPU 100 10 Gb etehrnet

) 80 c e s ( 60 40 Gb Infiniband e CUDA m i

T 40

20

0 6 4 2 0 8 6 4 2 9 4 9 4 8 3 8 3 0 1 1 2 2 3 3 4 4 6 8 0 2 4 6 8 1 1 1 1 1 Matrix dimension

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 50/84 Basic TPC/IP version

● The functionallity has been shown ● Almost all CUDA SDK examples have been tested

● As the network overhead can be minimized a remote rCUDA device will have a performance close to the local CUDA device.

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 51/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 52/84 InfiniBand version

● Why InfiniBand version? ● InfiniBand is the most used HPC network – Low latency and high bandwidth

Infiniband Infiniband QDR Infiniband DDR 38,4 Infiniband DDR 4x Gigabit Ethernet Propietary Custom X4 internal internnect Others

Top500 June 2011 Interconnect

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 53/84 InfiniBand version

● Why InfiniBand version? ● InfiniBand is the most used HPC network – Low latency and high bandwidth ● Good results are expected

70 SGEMM 60 40 Gb Infiniband

) 50 c e

s 40 ( CUDA

e 30 m i

T 20 10 0 6 4 2 0 8 6 4 2 9 4 9 4 8 3 8 3 0 1 1 2 2 3 3 4 4 6 8 0 2 4 6 8 1 1 1 1 1 Matrix dimension

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 54/84 InfiniBand version

● InfiniBand version facts:

● Use IB-Verbs – All TPC/IP stack overflow is out

● Our goal is to be as close as possible to the network peak performance – Bandwidth test of our IB network is about 2900 MB/sec

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 55/84 InfiniBand version

● Same user level functionallity ● Bandwidth client to/from remote GPU near the peak InfiniBand network bandwidth ● Use of GPUDirect ● Reduce the number of intra-node data movements ● Use of pipelined transfers ● Overlap intra-node data movements and transfers

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 56/84 InfiniBand version

Intra-node data movement: basic method

Two different main memory zones needed.

Main CPU memory INFINIBAND e - I C P

U GPU GPU P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 57/84 InfiniBand version

Intra-node data movement: basic method

Step 1 Copy data from GPU memory to the main memory associated with the GPU

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 58/84 InfiniBand version

Intra-node data movement: basic method

Step 2 Copy data between the two main memory buffers.

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 59/84 InfiniBand version

Intra-node data movement: basic method

Step 3 Send data from the main memory buffer associated with the network card.

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 60/84 InfiniBand version

Intra-node data movement: basic method

Three data movements have been needed.

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 61/84 InfiniBand version

Intra-node data movement: GPUDirect

Only ONE main memory zone is needed. This zone is bound to both the GPU and the network device

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 62/84 InfiniBand version

Intra-node data movement: GPUDirect

Step 1 Copy data from the GPU memory to the main memory.

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 63/84 InfiniBand version

Intra-node data movement: GPUDirect

Step 2 Send data from the main memory.

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 64/84 InfiniBand version

Intra-node data movement: GPUDirect Only TWO data movements have been needed.

Main CPU memory INFINIBAND e - I C P

U GPU GPU Chipset P proc mem G

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 65/84 InfiniBand version

Standard data transfers between nodes

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 66/84 InfiniBand version

Standard data transfers between nodes

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Client Copy to network buffers Network Server

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 67/84 InfiniBand version

Standard data transfers between nodes

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Client Copy to network buffers Network Send Server

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 68/84 InfiniBand version

Standard data transfers between nodes

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Client Copy to network buffers Network Send Server Copy to GPU

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 69/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Client network Network Server

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 70/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Copy to Copy to Client network network network Network Server

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 71/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Client network Network Server

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 72/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Copy to Client network network

Network Send Server

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 73/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Copy to Copy to Client network network network

Network Send Send

Server Copy to GPU

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 74/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Copy to Copy to Client network network network

Network Send Send Send

Server Copy to Copy to GPU GPU

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 75/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Copy to Copy to Client network network network

Network Send Send Send

Server Copy to Copy to Copy to GPU GPU GPU

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 76/84 InfiniBand version

Pipelined data transfers

rCUDA rCUDA Main memory client server Main memory

InfiniBand InfiniBand

CPU e CPU e - - I I C C P P

Chipset GPU Chipset

Copy to Copy to Copy to Client network network network This is the overhead

Network Send Send Send for transfering data to the remote node Server Copy to Copy to Copy to GPU GPU GPU

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 77/84 InfiniBand version

Bandwidth for a matrix of 4096 x 4096 single precission

5000 40Gb InfiniBand 4500 )

c 4000 e s

/ 3500 B IB peak bandwidth 2900 MB/sec

M 3000 (

h

t 2500 d i

w 2000 d

n 1500 a

B 1000 500 0 rCUDA GigaE rCUDA IPoIB rCUDA IBVerbs CUDA

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 78/84 InfiniBand version

Execution time for a matrix x matrix (dim=4096)

2,50 2,28 ●GeForce 9800 GTX ●Intel Xeon E5645 2,00 ) c e s ( 1,50 e 1,30 m i T 1,00 0,70 0,65 0,62 0,50

0,00 rCUDA IpoIB CUDA rCUDA GigaE rCUDA IBVerbs CPU (MKL)

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 79/84 Outline

● GPU computing ● GPU computing scenarios ● Introduction to rCUDA ● rCUDA structure ● rCUDA functionallity ● Basic TCP/IP version ● Infiniband version ● Work in progress and near future

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 80/84 Work in progress

● Dynamic remote GPU scheduling ● Port to Microsoft ● Full support for CUDA 4.0 ● Support for C/C++ extension ● Apply to OpenCL

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 81/84 Near future

● Support for iWARP communications ● Workload balance ● Remote GPU data ● Remote GPU kernel cache

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 82/84 More information

http://www.gap.upv.es/rCUDA http://www.hpca.uji.es/rCUDA

● GPU virtualization in high performance clusters. J. Duato, F. Igual, R. Mayo, A. J. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC 2009.

● rCUDA: reducing the number of GPU-based accelerators in high performance clusters. J.Duato, A. J. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM 2010.

● Performance of CUDA virtualized remote GPUs in high performance clusters. J. Duato, R. Mayo, A. J. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011

● Enabling CUDA acceleration within virtual machines using rCUDA. J. Duato, A. J. Peña, F. Silla, J. C. Fernández R. Mayo, E. S. Quintana. High Performance Computing Conference, HiPC 2011 .

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 83/84 http://www.gap.upv.es/rCUDA http://www.hpca.uji.es/rCUDA

PeoplePeople Antonio Peña Enrique S. Quintana-Ortí Jose Duato Rafael Mayo Federico Silla

Thanks to Mellanox and AIC for their support to this work

Rafael Mayo Gual 2011 HPC Advisory Council China Workshop 84/84