GPU Computing with MATLAB

Home , Computer cluster, Distributed computing

Loren Dean Director of Engineering, MATLAB Products MathWorks

1 Spectrogram shows 50x speedup in a GPU cluster

50x

2 Agenda

. Background . Leveraging the desktop – Basic GPU capabilities – Multiple GPUs on a single machine . Moving to the cluster – Multiple GPUs on multiple machines . Q&A

3 How many people are using…

. MATLAB . MATLAB with GPUs . Parallel Computing Toolbox – R2010b prerelease . MATLAB Distributed Computing Server

4 Why GPUs and why now?

. Operations are IEEE Compliant . Cross-platform support now available . Single/double performance inline with expectations

5 Parallel Computing with MATLAB Tools and Terminology

Computer Cluster

Desktop Computer MATLAB Distributed Computing Server™

Parallel Computing Toolbox™

Scheduler

6 Parallel Capabilities

Task Parallel Data Parallel Environment

matlabpool Built-in support with Simulink, toolboxes, Local workers Greater Control and blocksets

Configurations distributed array batch parfor >200 functions MathWorks job

Ease of Use Ease of manager spmd third-party schedulers job/task co-distributed array job/task MPI interface

7 Evolving With Technology Changes

GPU Single Multicore Multiprocessor Cluster processor Grid, Cloud

8 What’s new in R2010b?

. Parallel Computing Toolbox – GPU support – Broader algorithm support (QR, rectangular \) . MATLAB Distributed Computing Server – GPU support – Run as user with MathWorks job manager – Non-shared file system support . Simulink® – Real-Time Workshop® support with PCT and MDCS

9 GPU Functionality

. Call GPU(s) from MATLAB or toolbox/server worker . Support for CUDA 1.3 enabled devices . GPU array data type – Store arrays in GPU device memory – Algorithm support for over 100 functions – Integer and double support . GPU functions – Invoke element-wise MATLAB functions on the GPU . CUDA kernel interface – Invoke CUDA kernels directly from MATLAB – No MEX programming necessary

1010 Demo hardware

11 Example: GPU Arrays

>> A = someArray(1000, 1000); >> G = gpuArray(A); % Push to GPU memory … >> F = fft(G); >> x = G\b; … >> z = gather(x); % Bring back into MATLAB

1212 GPUArray Function Support

. >100 functions supported – fft, fft2, ifft, ifft2 – Matrix multiplication (A*B) – Matrix left division (A\b) – LU factorization – ‘ .’ – abs, acos, …, minus, …, plus, …, sin, … . Key functions not supported – conv, conv2, filter – indexing

1313 GPU Array benchmarks

Tesla Tesla Quad‐core Ratio A\b* C1060 C2050 Intel CPU (Fermi:CPU) (Fermi) Single 191 250 48 5:1 Double 63.1 128 25 5:1 Ratio 3:1 2:1 2:1 * Results in Gflops, matrix size 8192x8192. Limited by card memory. Computational capabilities not saturated. 14 GPU Array benchmarks

Tesla Tesla Quad‐core Ratio MTIMES C1060 C2050 Intel CPU (Fermi:CPU) (Fermi) Single 365 409 59 7:1 Double 75 175 29 6:1 Ratio 4.8:1 2.3:1 2:1

Tesla Tesla Quad‐core Ratio FFT C1060 C2050 Intel CPU (Fermi:CPU) (Fermi) Single 50 99 2.29 43:1 Double 22.5 44 1.47 30:1 Ratio 2.2:1 2.2:1 1.5:1

15 Example: arrayfun: Element-Wise Operations

>> y = arrayfun(@foo, x); % Execute on GPU

function y = foo(x) y = 1 + x.*(1 + x.*(1 + x.*(1 + ... x.*(1 + x.*(1 + x.*(1 + x.*(1 + ... x.*(1 + x./9)./8)./7)./6)./5)./4)./3)./2);

1616 Some arrayfun benchmarks

Note: Due to memory constraints, a different approach is used at N=16 and above.

CPU[4] = multhithreading enabled CPU[1] = multhithreading disabled 17 Example: Invoking CUDA Kernels

% Setup kern = parallel.gpu.CUDAKernel(‘myKern.ptx’, cFcnSig)

% Configure kern.ThreadBlockSize=[512 1]; kern.GridSize=[1024 1024];

% Run [c, d] = feval(kern, a, b);

1818 Options for scaling up

. Leverage matlabpool – Enables desktop or cluster cleanly – Can be done either interactively or in batch . Decide how to manage your data – Task parallel . Use parfor . Same operation with different inputs . No interdependencies between operations – Data parallel . Use spmd . Allows for interdependency between operations at the CPU level

19 MATLAB Pool Extends Desktop MATLAB

Worker Worker

Worker

OOLBOXES Worker T Worker Worker BLOCKSETS Worker

Worker

20 Example: Spectrogram on the desktop (CPU only)

D = data; iterations = 2000; % # of parallel iterations stride = iterations*step; %stride of outer loop

M = ceil((numel(x)-W)/stride);%iterations needed o = cell(M, 1); % preallocate output for i = 1:M % What are the start points thisSP = (i-1)*stride:step: … (min(numel(x)-W, i*stride)-1);

% Move the data efficiently into a matrix X = copyAndWindowInput(D, window, thisSP);

% Take lots of fft's down the colmuns X = abs(fft(X));

% Return only the first part to MATLAB o{i} = X(1:E, 1:ratio:end); end

21 Example: Spectrogram on the desktop (CPU to GPU)

D = data; D = gpuArray(data); iterations = 2000; % # of parallel iterations iterations = 2000; % # of parallel iterations stride = iterations*step; %stride of outer loop stride = iterations*step; %stride of outer loop

M = ceil((numel(x)-W)/stride);%iterations needed M = ceil((numel(x)-W)/stride);%iterations needed o = cell(M, 1); % preallocate output o = cell(M, 1); % preallocate output for i = 1:M for i = 1:M % What are the start points % What are the start points thisSP = (i-1)*stride:step: … thisSP = (i-1)*stride:step: ... (min(numel(x)-W, i*stride)-1); (min(numel(D)-W, i*stride)-1);

% Move the data efficiently into a matrix % Move the data efficiently into a matrix X = copyAndWindowInput(D, window, thisSP); X = copyAndWindowInput(D, window, thisSP);

% Take lots of fft's down the colmuns % Take lots of fft's down the colmuns X = abs(fft(X)); X = gather(abs(fft(X)));

% Return only the first part to MATLAB % Return only the first part to MATLAB o{i} = X(1:E, 1:ratio:end); o{i} = X(1:E, 1:ratio:end); end end

22 Example: Spectrogram on the desktop (GPU to parallel GPU)

D = gpuArray(data); D = gpuArray(data); iterations = 2000; % # of parallel iterations iterations = 2000; % # of parallel iterations stride = iterations*step; %stride of outer loop stride = iterations*step; %stride of outer loop

M = ceil((numel(x)-W)/stride);%iterations needed M = ceil((numel(x)-W)/stride);%iterations needed o = cell(M, 1); % preallocate output o = cell(M, 1); % preallocate output for i = 1:M parfor i = 1:M % What are the start points % What are the start points thisSP = (i-1)*stride:step: ... thisSP = (i-1)*stride:step: ... (min(numel(D)-W, i*stride)-1); (min(numel(D)-W, i*stride)-1);

% Move the data efficiently into a matrix % Move the data efficiently into a matrix X = copyAndWindowInput(D, window, thisSP); X = copyAndWindowInput(D, window, thisSP);

% Take lots of fft's down the colmuns % Take lots of fft's down the colmuns X = gather(abs(fft(X))); X = gather(abs(fft(X)));

% Return only the first part to MATLAB % Return only the first part to MATLAB o{i} = X(1:E, 1:ratio:end); o{i} = X(1:E, 1:ratio:end); end end

23 Spectrogram shows 50x speedup in a GPU cluster

50x

24 But…Speedup is 5x with data transfer (Remember we have to transfer data of GigE and then the PCIe bus)

25 GPU can go well beyond the CPU

~ 6 seconds

65x

~16 seconds

CPU ~ 10-15 seconds

26 Summary of Options for Targeting GPUs

Across one or more GPUs on one or more machines:

1) Use GPU array interface with

Greater Control MATLAB built-in functions

2) Execute custom functions on elements of the GPU array Ease of Use Ease of

3) Create kernels from existing CUDA code and PTX files

27 28 What hardware is supported?

. NVIDIA hardware meeting the CUDA 1.3 hardware spec. . A listing can be found at: http://www.nvidia.com/object/cuda_gpus.html

2929 How come function_xyz is not GPU- accelerated?

. The accelerated functions available in this first release were gated by available resources. . We will add capabilities with coming releases based on requirements and feedback.

3030 Why did we adopt CUDA and not OpenCL?

. CUDA has the only ecosystem with all of the libraries necessary for technical computing

3131 Why are CUDA 1.1 and CUDA 1.2 not supported?

As mentioned earlier, CUDA 1.3 offers the following capabilities that earlier releases of CUDA do not – Support for doubles. The base data type in MATLAB is double. – IEEE compliance. We want to insure we get the correct answer. – Cross-platform support.

3232 What benchmarks are available?

. Some benchmarks are available in the product and at www.mathworks.com/products/parallel- computing/ . More will be added over time

3333