ArrayFire Software Library for CUDA & OpenCL
Kyle Spafford, Senior Engineer GeoInt Accelerator • Platform to start developing with GPUs • GPU access, GPU-accelerated apps and libraries • Register to learn more: http://www.nvidia.com/geoint
Webinar Feedback Submit your feedback for a chance to win Tesla K20 GPU* https://www.surveymonkey.com/s/ArrayFire * Offer is valid until October 1st 2013
Overview
1. Introduction to ArrayFire & Case Studies Radar & Signal Processing Image Processing 2. Getting Started With ArrayFire Header Files Data Types N-Dimension Arrays Data Structures Basic Operations Graphics Package 3. Quick Demo of Graphics Library Introduction to ArrayFire
A fast software library for GPU computing with an easy-to-use API. Array-based function set that makes GPU programming simple. Available for C, C++, and Fortran and integrated with AMD, Intel, and NVIDIA hardware.
Introduction to ArrayFire Performance & Programmability
Super easy to program Highly optimized
Introduction to ArrayFire Portability Introduction to ArrayFire Scalability
Multi-GPU is 1-line of code
array *y = new array[n]; for (int i = 0; i < n; ++i) { deviceset(i); // change GPUs array x = randu(5,5); // add work to GPU’s queue y[i] = fft(x); // more work in queue }
// all GPUs are now computing simultaneously 1. Introduction to ArrayFire Community
Over 8,000 posts at http://forums.accelereyes.com Nightly library update releases Stable releases a few times a year v2.0 coming at the end of summer 1. Introduction to ArrayFire http://accelereyes.com/case_studies
17X 20X 20X 45X 12X
Neuro-imaging Viral Analyses Video Processing Radar Imaging Medical Devices Georgia Tech CDC Google System Planning Spencer Tech 1. Introduction to ArrayFire http://accelereyes.com/case_studies
5X 35X 17X 70X 35X
Weather Models Power Eng Surveillance Drug Delivery Bioinformatics NCAR IIT India BAE Systems Georgia Tech Leibnitz Radar Image Formation
Worked with System Planning Corp. Optimized a SAR/ISAR backprojection-based image formation algorithm 45x improvement over legacy code for large datasets Radar Clutter Reduction
Also with System Planning Corp. Optimized a clutter reduction algorithm based on multiple thresholding techniques 10x speedup over Core i7 CPU, 5x over DSP UAV Imaging – Ground Vehicle Detection UAV Imaging – Ground Vehicle Detection
Project: Ground vehicle detection (IRAD) Performance Baseline: 4.0 s (IPP enhanced CPU code) Goal: 0.4 s (real-time) Problem: Optimized CPU code 10X too slow. Too many small image patches of apparently not well-suited for GPUs. UAV Imaging – Ground Vehicle Detection
Avoid low-level hassle with raw CUDA Completed as a combined services and ArrayFire project Result: 0.4 second goal achieved in < 1 month Introduction to ArrayFire Hundreds of Functions
reductions signal processing • sum, min, max, count, prod • convolutions, FFT, FIR, geometry • vectors, columns, rows, etc IIR • sin, sinh, asin, asinh, cos, cosh, acos, acosh, tan, tanh, atan, graphics atanh, atan2, hypot • surface, arrow, plot, dense linear algebra image, volume, figure • LU, QR, Cholesky, SVD, Eigenvalues, math Inversion, Solvers, Determinant, image processing • sign, sqrt, root, pow, ceil, Matrix Power • filter, rotate, erode, floor, round, trunc, log, exp, dilate, morph, gamma, epsilon, erf, abs, arg, resize, rgb2gray, histograms array manipulation real, etc. • transpose, conjugate, lower, upper, diag, join, flip, tile, flat, shift, reorder device management • info, deviceset, deviceget, devicecount and many more… Header Files
• #include
* By including only “arrayfire.h” and using namespace af, you have all the ArrayFire functions ready to use
Monte Carlo Estimation of Pi
#include
Data Types b8 c32 8-bit boolean byte f32 complex real single precision array single precision f64 container object c64 real complex double precision s32 u32 double precision 32-bit signed integer 32-bit unsigned integer Generating Arrays
constant(0, 3) // 3-by-1 column of zeros of single-precision (f32 default) constant(1, 3, 2, f64) // 3-by-2 matrix of ones of double-precision rand(2, 4, u32) // random 32-bit integers, every bit is random randn(2, 2, s32) // square matrix of random values of 32-bit signed integer randu(5, 7, c32) // complex values, all real and complex components identity(3, 3, c64) // 3-by-3 identity of complex double-precision
* Almost every function that takes “data type” as parameter sets f32 for default Data Structures Bringing in host data
Example:
float A[]; A = { 1, 2, 3, 4, 5, 6, 7, 8, ... , 21, 22, 23, 24, 25 }
for ( int i=0; i<25; i++) { A[i] = i + 1;
} mat = [ 1 2 3 4 5 ] [ 6 7 8 9 10 ] array mat = array(5, 5, A, afHost); [ 11 12 13 14 15 ] [ 16 17 18 19 20 ] [ 21 22 23 24 25 ]
N-Dimension Support
vectors
matrices volumes Related Data Structures
seq - Index generator (sequential or strided) dim4 - Array dimensions descriptor timer - Internal timer object Exception How AF handles errors
*These are all in namespace “af” Data Structures Tons of ways to slice and dice arrays
Adjust the dimensions of an N-D array (fast) array (const array &input, const dim4 &dims) : Adjust input array using dim4 type array (const array &input, int dim0, int dim1=1, int dim2=1, int dim3=1) array (const array &) : Duplicate an existing array (copy constructor). array (const seq &s) : Convert a seq object for use with arithmetic.
Multiple Subscripts array operator() (int x) const : Access linear element x. array operator() (int row, int col) const : Access linear element at row and col. array operator() (int x, int y, int z) const : Access element. array operator() (const seq &x, int y, int z) const : Access vector. array operator() (int x, const seq &y, const seq &z) const : Access matrix. array operator() (const seq &w, const seq &x, const seq &y, const seq &z) const : Access Volume.
Data Structures Tons of ways to slice and dice arrays
Access row/col to form a vector array row (int i) const : Access row i to form row vector. array col (int i) const : Access column i to form column vector. array slice (int i) const : Access slice i to form a matrix (plane of a volume, channel in an image) Operator overloading +, *, -, /, %, +=, -=, *=, /=, %=, &&, ||, ^, ==, !=, <, <=, >, >=, =, << array as (dtype type) const : Cast array's contents to the given type. array T () const : Transpose matrix or vector. array H () const : Conjugate transpose (1+2i becomes 1-2i).
Data Structures seq
Sequential index generator seq (double n) : Create sequence {0 1 ... n-1} seq (double first, double inc, double last) : Create sequence {first first+inc ... last}. Data Structures array & seq
Example: array a(seq(4)); // a = [ 0 1 2 3 ] array b(-seq(4)); // b = [ 3 2 1 0 ] array c(seq(4)+1); // c = [1 2 3 4 ] array d(seq(4)*3); // d = [ 0 3 6 9 ] array e(seq(1, 0.5, 2)); // e = [ 1.0 0.5 2.0 ] array f = e.as(u32); // f = [ 1 1 2 ] f = f - 2; // f = [ -1 -1 0 ] array g = a.copy(); // g = [ 0 1 2 3 ] array h = g.T(); // h = [ 0 ] [ 1 ] [ 2 ] [ 3 ] Data Structures
int end : Reference last element in dimension. seq span : Reference entire dimension. A(span,span,2)
A(1,1) A(1,span)
A(end,1) A(end,span) Data Structures dim4
Array dimensions descriptor dim4 (unsigned x, unsigned y=1, unsigned z=1, unsigned w=1)
: If only x is specified, constructs column vector. dims () const : Access underlying array. elements () const : Number of elements (product of dimensions) rest (unsigned i) const : Product of dimensions i on. numdims () const : Number of dimensions set. Operator overloading [], ==, !=, = Data Structures Dim4 special accessors
Example: array a = constant(0,4,5,6); a.dims().elements(); // Returns 125 (4x5x6) a.dims().rest(1); // Returns 30 (5x6) a.dims().numdims() // Returns 3 Data Structures timer
A platform-independent timer with microsecond accuracy: start() : Starts a timer. stop() : Returns elapsed seconds since last start(). stop (timer start) : Returns elapsed seconds since start.
Accurate and reliable measurement of performance involves several factors: Executing enough iterations to achieve peak performance. Executing enough repetitions to amortize any overhead from system timers. Data Structures timer vs. timeit
double af::timeit (void(*)() fn) : Robust timing of a function (both CPU or GPU).
Basic timing is sometimes inaccurate because GPU takes some time for initialization and loading libraries on first run. timeit() solves this problem by performing multiple function invocations for a robust timing result. Useful method for benchmarking GPU or CPU functions
Take a look at an example in the next slide Data Structures timeit
static void test_function() { // What you want to time… }
int main(int argc, char **argv) { double elapsed_time = timeit(test_function); printf(“test_function took %.6g seconds\n", elapsed_time); }
Data Structures C++ Exceptions
Exception for ArrayFire exception () what () const throw () : returns the exception as string (char *) &operator<< (std::ostream &s, const exception &e) : prints the error message
Example: try { ... } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); throw; } Basic Operations
Generate and Fill Arrays Array Manipulation Arithmetic Functions Device Management Parallelized Loops: gfor
Basic Operations Generate and fill arrays
array constant (float value, unsigned d0 ~ d3, dtype ty=f32) Create array d0 x d1 x d2 x d3 with value array identity (unsigned d0 ~ d3, dtype ty=f32) Create identity matrix array d0 x d1 x d2 x d3 array randu (unsigned d0 ~ d3, dtype ty=f32) Create array d0 x d1 x d2 x d3 with uniformly-distributed random numbers between 0-1 array randn (unsigned d0 ~ d3, dtype ty=f32) Create array d0 x d1 x d2 x d3 with normally distributed random numbers between 0-1 array rand (unsigned d0 ~ d3, dtype ty=u32) Create array d0 x d1 x d2 x d3 randomly with 0 or 1
Basic Operations Generate and fill arrays
Example:
Create 3x3 matrix of zeros Create 3x3 matrix of random numbers between 0-1 array a = constant(0, 3, 3); array b = randu(3, 3);
a = b = [ 0 0 0 ] [ 0.2920 0.1541 0.6110 ] [ 0 0 0 ] [ 0.3194 0.4452 0.3073 ] [ 0 0 0 ] [ 0.8109 0.2080 0.4156 ]
Basic Operations Generate and fill arrays
Example:
array Y = randn(3, 3); // Y ~ N(0,1) array Z = rand(3, 3); // every bit is randomly 0 or 1 Y = Z = [ 0.2925 -0.3932 0.0083 ] [ 3179217846 4161663997 2866113984 ] [ -0.7184 2.5470 -0.2510 ] [ 3955638199 3973448494 472148682 ] [ 0.1000 -0.0034 0.1290 ] [ 167591721 1917059131 2019573489 ] array I = identity(3, 3); I = [ 1 0 0 ] [ 0 1 0 ] [ 0 0 1 ]
Basic Operations Array Manipulation
array T () const Transpose matrix or vector. array H () const Conjugate transpose (1+2i becomes 1-2i). array lower (const array &input, int diagonal=0) Extract lower triangular matrix. (diagonal==0 is the center diagonal (default), diagonal<0 is below). array upper (const array &input, int diagonal=0) Extract upper triangular matrix (diagonal==0 is the center diagonal (default), diagonal>0 is above). array diag (const array &input) Extract or form diagonal matrix (if vector, produce diagonal matrix, if matrix, extract diagonal vector). array join (int dim, const array &A ~ const array &D) Join multiple arrays along dimension dim. Basic Operations Array Manipulation
array flip (const array &in, unsigned dim) Flip array along a given dimension (base-zero index). array flat (const array &A) Flatten an array into column vector. array shift (const array &in, int dim0=0 ~ int dim3=0) or array shift (const array &in, const array &shift) Shift the values of an array around dimension (wrap around). array reorder (const array &in, int dim0=-1, int dim1=-1, int dim2=-1, int dim3=-1) Reorder dimensions of array.
Basic Operations Transpose
Example:
array X = randn(2,3); array Y = randn(2,3,c32); X = Y = [ 0.5434 -1.4913 0.1374 ] [ 0.2925 - 0.7184i 2.5470 - 0.0034i 0.1290 + 0.3728i ] [ -0.7168 1.4805 -1.2208 ] [ 0.1000 - 0.3932i 0.0083 - 0.2510i 1.0822 - 0.6650i ] X.T() = Y.H() = [ 0.5434 -0.7168 ] [ 0.2925 + 0.7184i 0.1000 + 0.3932i ] [ -1.4913 1.4805 ] [ 2.5470 + 0.0034i 0.0083 + 0.2510i ] [ 0.1374 -1.2208 ] [ 0.1290 - 0.3728i 1.0822 + 0.6650i ] Basic Operations Component Extraction
Example:
float p[] = { 8, 2, 3, 4, 9, 5, 6, 7, 1 }; float v[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1 } array Z(3, 3, p, afHost); array W(3, 3, v, afHost); Z = [ 8 4 6 ] W = [ 1 1 1 ] [ 2 9 7 ] [ 1 1 1 ] [ 3 5 1 ] [ 1 1 1 ] join(1, Z, W) = lower(Z, 0) = upper(Z, 0) = diag(Z) = [ 8 4 6 1 1 1 ] [ 8 0 0 ] [ 8 4 6 ] [ 8 ] [ 2 9 7 1 1 1 ] [ 2 9 0 ] [ 0 9 7 ] [ 9 ] [ 3 5 1 1 1 1 ] [ 3 5 1 ] [ 0 0 1 ] [ 1 ]
Basic Operations Array Manipulation
Example:
float z[] = {1,2,3,4}; array d(seq(5)); // d=[1 2 3 4] array a(2, 2, z); e = tile(d, 2); array b = flip(a, 0); f = tile(d, dim4(1, 2)); array c = flip(a, 1); f = a = b = c = e = [ 1 1 ] [ 1 3 ] [ 2 4 ] [ 3 1 ] [ 1 2 3 4 ] [ 2 2 ] [ 2 4 ] [ 1 3 ] [ 4 2 ] [ 1 2 3 4 ] [ 3 3 ] [ 4 4 ]
Basic Operations Array Manipulation
Example: float z[] = {1, 2, 3, 4, 5, 6, 7, 8, 9}; array c(seq(3)); array a(3, 3, z); array b(seq(5)); c = tile(c, 1, 3); a = b = c = [ 1 4 7 ] [ 0 1 2 3 4 ] [ 0 0 0 ] [ 2 5 8 ] [ 1 1 1 ] [ 3 6 9 ] shift(b, 2) = [ 2 2 2 ] [ 3 4 0 1 2 ] reorder(c, 1, 0) = flat(a) = [ 0 1 2 ] [ 0 1 2 ] [ 1 2 3 4 5 6 7 8 9 ] [ 0 1 2 ]
Basic Operations Arithmetic Functions
Provides all kinds of element-wise arithmetic functions for matrix array Operator Overloading negation, +, *, -, /, %, +=, -=, *=, /=, %=, ++, --, &, &&, |, ||, ^, ==, !=, <, <=, >, >= Geometry sin, sinh, asin, asinh, cos, cosh, acos, acosh, tan, tanh, atan, atanh, atan2, hypot, Check isFinite, isInf, isNaN Math sign, sqrt, root, pow2, pow, ceil, floor, round, trunc, min, max, log, log2, log10, log1p, exp, expm1, gamma, gammaln, epsilon, erf, erfc, erfinv, erfcinv, abs, arg, conjg, real, imag, complex, reman, mod Basic Operations Device Management
Simply provides four functions and that is all you need.
void info (bool isdebug=false) Print diagnostic information on driver, runtime, memory, and devices. void deviceset (int index) Switch to specified device. int deviceget () Return the index of current device. int devicecount () Returns the number of available devices. Basic Operations Parallelized Loops: gfor
The gfor-loop construct may be used to simultaneously launch all of the iterations of a for- loop on the GPU, as long as the iterations are independent. While the standard for-loop performs each iteration sequentially, ArrayFire's gfor-loop performs each iteration at the same time (in parallel). ArrayFire does this by tiling out the values of all loop iterations and then performing computation on those tiles in one pass. You can think of gfor as performing auto-vectorization of your code, e.g. you write a gfor-loop that increments every element of a vector but behind the scenes ArrayFire rewrites it to operate on the entire vector in parallel.
Basic Operations Parallelized Loops: gfor
Difference between for-loop and gfor-loop with example:
for (int i = 0; i < N; ++i) A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence
gfor (array i, N) A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel
= * = * = *
C(,,1) A(,,1) B C(,,2) A(,,2) B C(,,3) A(,,3) B Basic Operations Parallelized Loops: gfor
There are three formats for instantiating gfor-loops: 1. gfor(var,n) Creates a sequence {0, 1, ..., n-1} 2. gfor(var,first,last) Creates a sequence {first, first+1, ..., last} 3. gfor(var,first,incr,last) Creates a sequence {first, first+inc, first+2*inc, ..., last}
Example: All of the following represent the equivalent sequence: 0,1,2,3,4 gfor (array i, 5) gfor (array i, 0, 4) gfor (array i, 0, 1, 4) Basic Operations Parallelized Loops: gfor
gfor is one of the most significant advantages of ArrayFire over other libraries Exploits our JIT engine to reduce the number of kernel invocations and passes over data
Graphics Package
Once your data is an array, it is trivial to visualize with the AF graphics package Surface plot 2D Plot 3D Plot Volume Image Rendering Muti-plot Figures Graphics Package Surface
surface (const array &X) : Draw a surface plot with 2D data Example:
Graphics Package 2D Plot
plot2 (const array &X) : Draws a line plot with 1D data
plot2 (const array &X, const char *linestyle) Example:
Graphics Package 3D Plot plot3 (const array &X, const af::array &Y, const array &Z) : Draw a scatter plot with 3D data Example:
Graphics Package Volume
volume (const array &X) : Draw a volume with 3D data Example:
Graphics Package Image
image (const array &X) : Draw a single scale image with 2D data Example:
Graphics Package Figure
Display a figure window Containing multiple plots
fig("color", table) : alter the color palette fig("clear") : clear the figure fig("draw") : redraw the figure (blocking) fig("title", name) : label the figure with a title fig("close") : close the figure window (blocking) fig(“sub”, # of column, # of row, Nth) : divide the window into blocks
Graphics Package Color Palette
The fig() function fig("color", palette) accepts the following color maps :
List of Built-in Functions
Data Analysis: sum, mul, min, max, minmax, alltrue, anytrue, where, count, segsum, accum, mean, var, cov, stdev, median, corrcoef, diff1, diff2, grad, setunique, setunion, setintersect, sort, sortdim, historgram, histequal Linear Algebra: dot, matopts, matmul, solve, lu, qr, cholesky, hessenberg, eigen, svd, norm, inverse, matpow, rank, det Image and Signal Processing: erode, dilate, rotate, resize, colorspace, fir, iir, regions, areas, centroids, moments, fft, ifft, fft2, ifft2, fft3, ifft3, filter, medfilt, bilateral, meanshift, convolve, approx1, approx2 Sparse Matrices: sparse, dense, where
List of Built-in Functions where
array c = randu(5, 5); array d = where(c > 0.5);
c = [ 0.7402 0.4464 0.7762 0.2920 0.2080 ] [ 0.9210 0.6673 0.2948 0.3194 0.6110 ] [ 0.0390 0.1099 0.7140 0.8109 0.3073 ] [ 0.9690 0.4702 0.3585 0.1541 0.4156 ] [ 0.9251 0.5132 0.6814 0.4452 0.2343 ]
d = [ 0.0000 ] c(d) = [ 0.7402 ] [ 1.0000 ] [ 0.9210 ] [ 3.0000 ] [ 0.9690 ] [ 4.0000 ] [ 0.9251 ] [ 6.0000 ] [ 0.6673 ] [ 9.0000 ] [ 0.5132 ] [ 10.0000 ] [ 0.7762 ] [ 12.0000 ] [ 0.7140 ] [ 14.0000 ] [ 0.6814 ] [ 17.0000 ] [ 0.8109 ] [ 21.0000 ] [ 0.6110 ]
Image Processing Examples
Swap Channels Transpose Transforms Binary, Erosion, Dilate Histograms Filtering Image Smoothing
Image Processing Examples Swap Channels
RGB BGR array tmp = img(span,span,0); // save the Red channel img(span,span,0) = img(span,span,2); // Blue -> Red img(span,span,2) = tmp; // Red -> Blue
// Joining components array swapped = join(2, img(span,span,2), // blue img(span,span,1), // green img(span,span,0)); // red
Image Processing Examples Transpose
array img = loadimage("image.jpg", false); array img_T = img.T(); Image Processing Examples Transform
array half = resize(0.5, img); array rot90 = rotate(img, af::Pi/2); array warped = approx2(img, xLocations, yLocations);
Image Processing Examples Binary, Erosion, Dilate Image Processing Examples Convolutions and Filtering
array R = convolve(img, ker); // 1,2,3d convolution array R = convolve(fcol, frow, img); // Separable convolution array R = filter(img, ker); // 2d correlation filter Image Processing Examples Image Smoothing
array S = bilateral(I, sigma_r, sigma_c); array M = meanshift(I, sigma_r, sigma_c, iter); array R = medfilt(img, 3, 3);
// Gaussian blur array gker = gaussiankernel(ncols, ncols); array res = convolve(img, gker);
Questions?
Download free version: http://accelereyes.com Email: [email protected]
GeoInt Accelerator • Platform to start developing with GPUs • GPU access, GPU-accelerated apps and libraries • Register to learn more: http://www.nvidia.com/geoint
Webinar Feedback Submit your feedback for a chance to win Tesla K20 GPU* https://www.surveymonkey.com/s/ArrayFire * Offer is valid until October 1st 2013
Upcoming GTC Express Webinars
September 19 - Learn How to Debug OpenGL 4.2 with NVIDIA® Nsight™ Visual Studio Edition 3.1
September 24 - Pythonic Parallel Patterns for the GPU with NumbaPro
September 25 - An Introduction to GPU Programming
September 26 - Learn How to Profile OpenGL 4.2 with NVIDIA® Nsight™ Visual Studio Edition 3.1
October 29 - How to Improve Performance using the CUDA Memory Model and Features of the Kepler Architecture
Register at www.gputechconf.com/gtcexpress GTC 2014 Call for Submissions
Looking for submissions in the fields of
. Science and research . Professional graphics . Mobile computing . Automotive applications . Game development . Cloud computing
Submit by September 27 at www.gputechconf.com