ArrayFire Software Library for CUDA & OpenCL

Kyle Spafford, Senior Engineer GeoInt Accelerator • Platform to start developing with GPUs • GPU access, GPU-accelerated apps and libraries • Register to learn more: http://www.nvidia.com/geoint

Webinar Feedback Submit your feedback for a chance to win Tesla K20 GPU* https://www.surveymonkey.com/s/ArrayFire * Offer is valid until October 1st 2013

Overview

1. Introduction to ArrayFire & Case Studies  Radar & Signal Processing  Image Processing 2. Getting Started With ArrayFire  Header Files  Data Types  N-Dimension Arrays  Data Structures  Basic Operations  Graphics Package 3. Quick Demo of Graphics Library Introduction to ArrayFire

 A fast software library for GPU computing with an easy-to-use API.  Array-based function set that makes GPU programming simple.  Available for C, C++, and Fortran and integrated with AMD, Intel, and NVIDIA hardware.

Introduction to ArrayFire Performance & Programmability

 Super easy to program  Highly optimized

Introduction to ArrayFire Portability Introduction to ArrayFire Scalability

 Multi-GPU is 1-line of code

array *y = new array[n]; for (int i = 0; i < n; ++i) { deviceset(i); // change GPUs array x = randu(5,5); // add work to GPU’s queue y[i] = fft(x); // more work in queue }

// all GPUs are now computing simultaneously 1. Introduction to ArrayFire Community

 Over 8,000 posts at http://forums.accelereyes.com  Nightly library update releases  Stable releases a few times a year  v2.0 coming at the end of summer 1. Introduction to ArrayFire http://accelereyes.com/case_studies

17X 20X 20X 45X 12X

Neuro-imaging Viral Analyses Video Processing Radar Imaging Medical Devices Georgia Tech CDC Google System Planning Spencer Tech 1. Introduction to ArrayFire http://accelereyes.com/case_studies

5X 35X 17X 70X 35X

Weather Models Power Eng Surveillance Drug Delivery Bioinformatics NCAR IIT India BAE Systems Georgia Tech Leibnitz Radar Image Formation

 Worked with System Planning Corp.  Optimized a SAR/ISAR backprojection-based image formation algorithm  45x improvement over legacy code for large datasets Radar Clutter Reduction

 Also with System Planning Corp.  Optimized a clutter reduction algorithm based on multiple thresholding techniques  10x speedup over Core i7 CPU, 5x over DSP UAV Imaging – Ground Vehicle Detection UAV Imaging – Ground Vehicle Detection

 Project: Ground vehicle detection (IRAD)  Performance  Baseline: 4.0 s (IPP enhanced CPU code)  Goal: 0.4 s (real-time)  Problem: Optimized CPU code 10X too slow. Too many small image patches of apparently not well-suited for GPUs. UAV Imaging – Ground Vehicle Detection

 Avoid low-level hassle with raw CUDA  Completed as a combined services and ArrayFire project  Result: 0.4 second goal achieved in < 1 month Introduction to ArrayFire Hundreds of Functions

reductions signal processing • sum, min, max, count, prod • convolutions, FFT, FIR, geometry • vectors, columns, rows, etc IIR • sin, sinh, asin, asinh, cos, cosh, acos, acosh, tan, tanh, atan, graphics atanh, atan2, hypot • surface, arrow, plot, dense linear algebra image, volume, figure • LU, QR, Cholesky, SVD, Eigenvalues, math Inversion, Solvers, Determinant, image processing • sign, sqrt, root, pow, ceil, Matrix Power • filter, rotate, erode, floor, round, trunc, log, exp, dilate, morph, gamma, epsilon, erf, abs, arg, resize, rgb2gray, histograms array manipulation real, etc. • transpose, conjugate, lower, upper, diag, join, flip, tile, flat, shift, reorder device management • info, deviceset, deviceget, devicecount and many more… Header Files

• #include • header file for all ArrayFire functions  All AF functions are organized into the af namespace

* By including only “.h” and using namespace af, you have all the ArrayFire functions ready to use

Monte Carlo Estimation of Pi

#include #include using namespace af; int main() { // 20 million random samples int n = 20e6; array x = randu(n,1), y = randu(n,1); // how many fell inside unit circle? float pi = 4 * sum(sqrt(mul(x,x)+mul(y,y))<1) / n; printf("pi = %g\n", pi); return 0; }

Data Types b8 c32 8-bit boolean byte f32 complex real single precision array single precision f64 container object c64 real complex double precision s32 u32 double precision 32-bit signed integer 32-bit unsigned integer Generating Arrays

 constant(0, 3) // 3-by-1 column of zeros of single-precision (f32 default)  constant(1, 3, 2, f64) // 3-by-2 matrix of ones of double-precision  rand(2, 4, u32) // random 32-bit integers, every bit is random  randn(2, 2, s32) // square matrix of random values of 32-bit signed integer  randu(5, 7, c32) // complex values, all real and complex components  identity(3, 3, c64) // 3-by-3 identity of complex double-precision

* Almost every function that takes “data type” as parameter sets f32 for default Data Structures Bringing in host data

Example:

float A[]; A = { 1, 2, 3, 4, 5, 6, 7, 8, ... , 21, 22, 23, 24, 25 }

for ( int i=0; i<25; i++) { A[i] = i + 1;

} mat = [ 1 2 3 4 5 ] [ 6 7 8 9 10 ] array mat = array(5, 5, A, afHost); [ 11 12 13 14 15 ] [ 16 17 18 19 20 ] [ 21 22 23 24 25 ]

N-Dimension Support

vectors

matrices volumes Related Data Structures

 seq - Index generator (sequential or strided)  dim4 - Array dimensions descriptor  timer - Internal timer object  Exception  How AF handles errors

*These are all in namespace “af” Data Structures Tons of ways to slice and dice arrays

Adjust the dimensions of an N-D array (fast)  array (const array &input, const dim4 &dims) : Adjust input array using dim4 type  array (const array &input, int dim0, int dim1=1, int dim2=1, int dim3=1)  array (const array &) : Duplicate an existing array (copy constructor).  array (const seq &s) : Convert a seq object for use with arithmetic.

Multiple Subscripts  array operator() (int x) const : Access linear element x.  array operator() (int row, int col) const : Access linear element at row and col.  array operator() (int x, int y, int z) const : Access element.  array operator() (const seq &x, int y, int z) const : Access vector.  array operator() (int x, const seq &y, const seq &z) const : Access matrix.  array operator() (const seq &w, const seq &x, const seq &y, const seq &z) const : Access Volume.

Data Structures Tons of ways to slice and dice arrays

Access row/col to form a vector  array row (int i) const : Access row i to form row vector.  array col (int i) const : Access column i to form column vector.  array slice (int i) const : Access slice i to form a matrix (plane of a volume, channel in an image) Operator overloading  +, *, -, /, %, +=, -=, *=, /=, %=, &&, ||, ^, ==, !=, <, <=, >, >=, =, <<  array as (dtype type) const : Cast array's contents to the given type.  array T () const : Transpose matrix or vector.  array H () const : Conjugate transpose (1+2i becomes 1-2i).

Data Structures seq

Sequential index generator  seq (double n) : Create sequence {0 1 ... n-1}  seq (double first, double inc, double last) : Create sequence {first first+inc ... last}. Data Structures array & seq

Example: array a(seq(4)); // a = [ 0 1 2 3 ] array b(-seq(4)); // b = [ 3 2 1 0 ] array c(seq(4)+1); // c = [1 2 3 4 ] array d(seq(4)*3); // d = [ 0 3 6 9 ] array e(seq(1, 0.5, 2)); // e = [ 1.0 0.5 2.0 ] array f = e.as(u32); // f = [ 1 1 2 ] f = f - 2; // f = [ -1 -1 0 ] array g = a.copy(); // g = [ 0 1 2 3 ] array h = g.T(); // h = [ 0 ] [ 1 ] [ 2 ] [ 3 ] Data Structures

int end : Reference last element in dimension. seq span : Reference entire dimension. A(span,span,2)

A(1,1) A(1,span)

A(end,1) A(end,span) Data Structures dim4

Array dimensions descriptor  dim4 (unsigned x, unsigned y=1, unsigned z=1, unsigned w=1)

: If only x is specified, constructs column vector.  dims () const : Access underlying array.  elements () const : Number of elements (product of dimensions)  rest (unsigned i) const : Product of dimensions i on.  numdims () const : Number of dimensions set. Operator overloading  [], ==, !=, = Data Structures Dim4 special accessors

Example: array a = constant(0,4,5,6); a.dims().elements(); // Returns 125 (4x5x6) a.dims().rest(1); // Returns 30 (5x6) a.dims().numdims() // Returns 3 Data Structures timer

A platform-independent timer with microsecond accuracy:  start() : Starts a timer.  stop() : Returns elapsed seconds since last start().  stop (timer start) : Returns elapsed seconds since start.

Accurate and reliable measurement of performance involves several factors:  Executing enough iterations to achieve peak performance.  Executing enough repetitions to amortize any overhead from system timers. Data Structures timer vs. timeit

double af::timeit (void(*)() fn) : Robust timing of a function (both CPU or GPU).

 Basic timing is sometimes inaccurate because GPU takes some time for initialization and loading libraries on first run.  timeit() solves this problem by performing multiple function invocations for a robust timing result.  Useful method for benchmarking GPU or CPU functions

Take a look at an example in the next slide Data Structures timeit

static void test_function() { // What you want to time… }

int main(int argc, char **argv) { double elapsed_time = timeit(test_function); printf(“test_function took %.6g seconds\n", elapsed_time); }

Data Structures C++ Exceptions

Exception for ArrayFire  exception ()  what () const throw () : returns the exception as string (char *)  &operator<< (std::ostream &s, const exception &e) : prints the error message

Example: try { ... } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); throw; } Basic Operations

 Generate and Fill Arrays  Array Manipulation  Arithmetic Functions  Device Management  Parallelized Loops: gfor

Basic Operations Generate and fill arrays

array constant (float value, unsigned d0 ~ d3, dtype ty=f32)  Create array d0 x d1 x d2 x d3 with value array identity (unsigned d0 ~ d3, dtype ty=f32)  Create identity matrix array d0 x d1 x d2 x d3 array randu (unsigned d0 ~ d3, dtype ty=f32)  Create array d0 x d1 x d2 x d3 with uniformly-distributed random numbers between 0-1 array randn (unsigned d0 ~ d3, dtype ty=f32)  Create array d0 x d1 x d2 x d3 with normally distributed random numbers between 0-1 array rand (unsigned d0 ~ d3, dtype ty=u32)  Create array d0 x d1 x d2 x d3 randomly with 0 or 1

Basic Operations Generate and fill arrays

Example:

Create 3x3 matrix of zeros Create 3x3 matrix of random numbers between 0-1 array a = constant(0, 3, 3); array b = randu(3, 3);

a = b = [ 0 0 0 ] [ 0.2920 0.1541 0.6110 ] [ 0 0 0 ] [ 0.3194 0.4452 0.3073 ] [ 0 0 0 ] [ 0.8109 0.2080 0.4156 ]

Basic Operations Generate and fill arrays

Example:

array Y = randn(3, 3); // Y ~ N(0,1) array Z = rand(3, 3); // every bit is randomly 0 or 1 Y = Z = [ 0.2925 -0.3932 0.0083 ] [ 3179217846 4161663997 2866113984 ] [ -0.7184 2.5470 -0.2510 ] [ 3955638199 3973448494 472148682 ] [ 0.1000 -0.0034 0.1290 ] [ 167591721 1917059131 2019573489 ] array I = identity(3, 3); I = [ 1 0 0 ] [ 0 1 0 ] [ 0 0 1 ]

Basic Operations Array Manipulation

array T () const  Transpose matrix or vector. array H () const  Conjugate transpose (1+2i becomes 1-2i). array lower (const array &input, int diagonal=0)  Extract lower triangular matrix. (diagonal==0 is the center diagonal (default), diagonal<0 is below). array upper (const array &input, int diagonal=0)  Extract upper triangular matrix (diagonal==0 is the center diagonal (default), diagonal>0 is above). array diag (const array &input)  Extract or form diagonal matrix (if vector, produce diagonal matrix, if matrix, extract diagonal vector). array join (int dim, const array &A ~ const array &D)  Join multiple arrays along dimension dim. Basic Operations Array Manipulation

array flip (const array &in, unsigned dim)  Flip array along a given dimension (base-zero index). array flat (const array &A)  Flatten an array into column vector. array shift (const array &in, int dim0=0 ~ int dim3=0) or array shift (const array &in, const array &shift)  Shift the values of an array around dimension (wrap around). array reorder (const array &in, int dim0=-1, int dim1=-1, int dim2=-1, int dim3=-1)  Reorder dimensions of array.

Basic Operations Transpose

Example:

array X = randn(2,3); array Y = randn(2,3,c32); X = Y = [ 0.5434 -1.4913 0.1374 ] [ 0.2925 - 0.7184i 2.5470 - 0.0034i 0.1290 + 0.3728i ] [ -0.7168 1.4805 -1.2208 ] [ 0.1000 - 0.3932i 0.0083 - 0.2510i 1.0822 - 0.6650i ] X.T() = Y.H() = [ 0.5434 -0.7168 ] [ 0.2925 + 0.7184i 0.1000 + 0.3932i ] [ -1.4913 1.4805 ] [ 2.5470 + 0.0034i 0.0083 + 0.2510i ] [ 0.1374 -1.2208 ] [ 0.1290 - 0.3728i 1.0822 + 0.6650i ] Basic Operations Component Extraction

Example:

float p[] = { 8, 2, 3, 4, 9, 5, 6, 7, 1 }; float v[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1 } array Z(3, 3, p, afHost); array W(3, 3, v, afHost); Z = [ 8 4 6 ] W = [ 1 1 1 ] [ 2 9 7 ] [ 1 1 1 ] [ 3 5 1 ] [ 1 1 1 ] join(1, Z, W) = lower(Z, 0) = upper(Z, 0) = diag(Z) = [ 8 4 6 1 1 1 ] [ 8 0 0 ] [ 8 4 6 ] [ 8 ] [ 2 9 7 1 1 1 ] [ 2 9 0 ] [ 0 9 7 ] [ 9 ] [ 3 5 1 1 1 1 ] [ 3 5 1 ] [ 0 0 1 ] [ 1 ]

Basic Operations Array Manipulation

Example:

float z[] = {1,2,3,4}; array d(seq(5)); // d=[1 2 3 4] array a(2, 2, z); e = tile(d, 2); array b = flip(a, 0); f = tile(d, dim4(1, 2)); array c = flip(a, 1); f = a = b = c = e = [ 1 1 ] [ 1 3 ] [ 2 4 ] [ 3 1 ] [ 1 2 3 4 ] [ 2 2 ] [ 2 4 ] [ 1 3 ] [ 4 2 ] [ 1 2 3 4 ] [ 3 3 ] [ 4 4 ]

Basic Operations Array Manipulation

Example: float z[] = {1, 2, 3, 4, 5, 6, 7, 8, 9}; array c(seq(3)); array a(3, 3, z); array b(seq(5)); c = tile(c, 1, 3); a = b = c = [ 1 4 7 ] [ 0 1 2 3 4 ] [ 0 0 0 ] [ 2 5 8 ] [ 1 1 1 ] [ 3 6 9 ] shift(b, 2) = [ 2 2 2 ] [ 3 4 0 1 2 ] reorder(c, 1, 0) = flat(a) = [ 0 1 2 ] [ 0 1 2 ] [ 1 2 3 4 5 6 7 8 9 ] [ 0 1 2 ]

Basic Operations Arithmetic Functions

Provides all kinds of element-wise arithmetic functions for matrix array  Operator Overloading negation, +, *, -, /, %, +=, -=, *=, /=, %=, ++, --, &, &&, |, ||, ^, ==, !=, <, <=, >, >=  Geometry sin, sinh, asin, asinh, cos, cosh, acos, acosh, tan, tanh, atan, atanh, atan2, hypot,  Check isFinite, isInf, isNaN  Math sign, sqrt, root, pow2, pow, ceil, floor, round, trunc, min, max, log, log2, log10, log1p, exp, expm1, gamma, gammaln, epsilon, erf, erfc, erfinv, erfcinv, abs, arg, conjg, real, imag, complex, reman, mod Basic Operations Device Management

Simply provides four functions and that is all you need.

 void info (bool isdebug=false) Print diagnostic information on driver, runtime, memory, and devices.  void deviceset (int index) Switch to specified device.  int deviceget () Return the index of current device.  int devicecount () Returns the number of available devices. Basic Operations Parallelized Loops: gfor

 The gfor-loop construct may be used to simultaneously launch all of the iterations of a for- loop on the GPU, as long as the iterations are independent.  While the standard for-loop performs each iteration sequentially, ArrayFire's gfor-loop performs each iteration at the same time (in parallel).  ArrayFire does this by tiling out the values of all loop iterations and then performing computation on those tiles in one pass.  You can think of gfor as performing auto-vectorization of your code,  e.g. you write a gfor-loop that increments every element of a vector but behind the scenes ArrayFire rewrites it to operate on the entire vector in parallel.

Basic Operations Parallelized Loops: gfor

Difference between for-loop and gfor-loop with example:

for (int i = 0; i < N; ++i) A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence

gfor (array i, N) A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel

= * = * = *

C(,,1) A(,,1) B C(,,2) A(,,2) B C(,,3) A(,,3) B Basic Operations Parallelized Loops: gfor

There are three formats for instantiating gfor-loops: 1. gfor(var,n) Creates a sequence {0, 1, ..., n-1} 2. gfor(var,first,last) Creates a sequence {first, first+1, ..., last} 3. gfor(var,first,incr,last) Creates a sequence {first, first+inc, first+2*inc, ..., last}

Example: All of the following represent the equivalent sequence: 0,1,2,3,4 gfor (array i, 5) gfor (array i, 0, 4) gfor (array i, 0, 1, 4) Basic Operations Parallelized Loops: gfor

 gfor is one of the most significant advantages of ArrayFire over other libraries  Exploits our JIT engine to reduce the number of kernel invocations and passes over data

Graphics Package

 Once your data is an array, it is trivial to visualize with the AF graphics package  Surface plot  2D Plot  3D Plot  Volume  Image Rendering  Muti-plot Figures Graphics Package Surface

surface (const array &X) : Draw a surface plot with 2D data Example:

Graphics Package 2D Plot

plot2 (const array &X) : Draws a line plot with 1D data

plot2 (const array &X, const char *linestyle) Example:

Graphics Package 3D Plot plot3 (const array &X, const af::array &Y, const array &Z) : Draw a scatter plot with 3D data Example:

Graphics Package Volume

volume (const array &X) : Draw a volume with 3D data Example:

Graphics Package Image

image (const array &X) : Draw a single scale image with 2D data Example:

Graphics Package Figure

Display a figure window Containing multiple plots

fig("color", table) : alter the color palette fig("clear") : clear the figure fig("draw") : redraw the figure (blocking) fig("title", name) : label the figure with a title fig("close") : close the figure window (blocking) fig(“sub”, # of column, # of row, Nth) : divide the window into blocks

Graphics Package Color Palette

The fig() function fig("color", palette) accepts the following color maps :

List of Built-in Functions

Data Analysis: sum, mul, min, max, minmax, alltrue, anytrue, where, count, segsum, accum, mean, var, cov, stdev, median, corrcoef, diff1, diff2, grad, setunique, setunion, setintersect, sort, sortdim, historgram, histequal Linear Algebra: dot, matopts, matmul, solve, lu, qr, cholesky, hessenberg, eigen, svd, norm, inverse, matpow, rank, det Image and Signal Processing: erode, dilate, rotate, resize, colorspace, fir, iir, regions, areas, centroids, moments, fft, ifft, fft2, ifft2, fft3, ifft3, filter, medfilt, bilateral, meanshift, convolve, approx1, approx2 Sparse Matrices: sparse, dense, where

List of Built-in Functions where

array c = randu(5, 5); array d = where(c > 0.5);

c = [ 0.7402 0.4464 0.7762 0.2920 0.2080 ] [ 0.9210 0.6673 0.2948 0.3194 0.6110 ] [ 0.0390 0.1099 0.7140 0.8109 0.3073 ] [ 0.9690 0.4702 0.3585 0.1541 0.4156 ] [ 0.9251 0.5132 0.6814 0.4452 0.2343 ]

d = [ 0.0000 ] c(d) = [ 0.7402 ] [ 1.0000 ] [ 0.9210 ] [ 3.0000 ] [ 0.9690 ] [ 4.0000 ] [ 0.9251 ] [ 6.0000 ] [ 0.6673 ] [ 9.0000 ] [ 0.5132 ] [ 10.0000 ] [ 0.7762 ] [ 12.0000 ] [ 0.7140 ] [ 14.0000 ] [ 0.6814 ] [ 17.0000 ] [ 0.8109 ] [ 21.0000 ] [ 0.6110 ]

Image Processing Examples

 Swap Channels  Transpose  Transforms  Binary, Erosion, Dilate  Histograms  Filtering  Image Smoothing

Image Processing Examples Swap Channels

RGB  BGR array tmp = img(span,span,0); // save the Red channel img(span,span,0) = img(span,span,2); // Blue -> Red img(span,span,2) = tmp; // Red -> Blue

// Joining components array swapped = join(2, img(span,span,2), // blue img(span,span,1), // green img(span,span,0)); // red

Image Processing Examples Transpose

array img = loadimage("image.jpg", false); array img_T = img.T(); Image Processing Examples Transform

array half = resize(0.5, img); array rot90 = rotate(img, af::Pi/2); array warped = approx2(img, xLocations, yLocations);

Image Processing Examples Binary, Erosion, Dilate Image Processing Examples Convolutions and Filtering

array R = convolve(img, ker); // 1,2,3d convolution array R = convolve(fcol, frow, img); // Separable convolution array R = filter(img, ker); // 2d correlation filter Image Processing Examples Image Smoothing

array S = bilateral(I, sigma_r, sigma_c); array M = meanshift(I, sigma_r, sigma_c, iter); array R = medfilt(img, 3, 3);

// Gaussian blur array gker = gaussiankernel(ncols, ncols); array res = convolve(img, gker);

Questions?

 Download free version: http://accelereyes.com  Email: [email protected]

GeoInt Accelerator • Platform to start developing with GPUs • GPU access, GPU-accelerated apps and libraries • Register to learn more: http://www.nvidia.com/geoint

Webinar Feedback Submit your feedback for a chance to win Tesla K20 GPU* https://www.surveymonkey.com/s/ArrayFire * Offer is valid until October 1st 2013

Upcoming GTC Express Webinars

September 19 - Learn How to Debug OpenGL 4.2 with NVIDIA® Nsight™ Visual Studio Edition 3.1

September 24 - Pythonic Parallel Patterns for the GPU with NumbaPro

September 25 - An Introduction to GPU Programming

September 26 - Learn How to Profile OpenGL 4.2 with NVIDIA® Nsight™ Visual Studio Edition 3.1

October 29 - How to Improve Performance using the CUDA Memory Model and Features of the Kepler Architecture

Register at www.gputechconf.com/gtcexpress GTC 2014 Call for Submissions

Looking for submissions in the fields of

. Science and research . Professional graphics . Mobile computing . Automotive applications . Game development . Cloud computing

Submit by September 27 at www.gputechconf.com