Arrayfire Software Library for CUDA & Opencl

ArrayFire Software Library for CUDA & OpenCL Kyle Spafford, Senior Engineer GeoInt Accelerator • Platform to start developing with GPUs • GPU access, GPU-accelerated apps and libraries • Register to learn more: http://www.nvidia.com/geoint Webinar Feedback Submit your feedback for a chance to win Tesla K20 GPU* https://www.surveymonkey.com/s/ArrayFire * Offer is valid until October 1st 2013 Overview 1. Introduction to ArrayFire & Case Studies Radar & Signal Processing Image Processing 2. Getting Started With ArrayFire Header Files Data Types N-Dimension Arrays Data Structures Basic Operations Graphics Package 3. Quick Demo of Graphics Library Introduction to ArrayFire A fast software library for GPU computing with an easy-to-use API. Array-based function set that makes GPU programming simple. Available for C, C++, and Fortran and integrated with AMD, Intel, and NVIDIA hardware. Introduction to ArrayFire Performance & Programmability Super easy to program Highly optimized Introduction to ArrayFire Portability Introduction to ArrayFire Scalability Multi-GPU is 1-line of code array *y = new array[n]; for (int i = 0; i < n; ++i) { deviceset(i); // change GPUs array x = randu(5,5); // add work to GPU’s queue y[i] = fft(x); // more work in queue } // all GPUs are now computing simultaneously 1. Introduction to ArrayFire Community Over 8,000 posts at http://forums.accelereyes.com Nightly library update releases Stable releases a few times a year v2.0 coming at the end of summer 1. Introduction to ArrayFire http://accelereyes.com/case_studies 17X 20X 20X 45X 12X Neuro-imaging Viral Analyses Video Processing Radar Imaging Medical Devices Georgia Tech CDC Google System Planning Spencer Tech 1. Introduction to ArrayFire http://accelereyes.com/case_studies 5X 35X 17X 70X 35X Weather Models Power Eng Surveillance Drug Delivery Bioinformatics NCAR IIT India BAE Systems Georgia Tech Leibnitz Radar Image Formation Worked with System Planning Corp. Optimized a SAR/ISAR backprojection-based image formation algorithm 45x improvement over legacy code for large datasets Radar Clutter Reduction Also with System Planning Corp. Optimized a clutter reduction algorithm based on multiple thresholding techniques 10x speedup over Core i7 CPU, 5x over DSP UAV Imaging – Ground Vehicle Detection UAV Imaging – Ground Vehicle Detection Project: Ground vehicle detection (IRAD) Performance Baseline: 4.0 s (IPP enhanced CPU code) Goal: 0.4 s (real-time) Problem: Optimized CPU code 10X too slow. Too many small image patches of apparently not well-suited for GPUs. UAV Imaging – Ground Vehicle Detection Avoid low-level hassle with raw CUDA Completed as a combined services and ArrayFire project Result: 0.4 second goal achieved in < 1 month Introduction to ArrayFire Hundreds of Functions reductions signal processing • sum, min, max, count, prod • convolutions, FFT, FIR, geometry • vectors, columns, rows, etc IIR • sin, sinh, asin, asinh, cos, cosh, acos, acosh, tan, tanh, atan, graphics atanh, atan2, hypot • surface, arrow, plot, dense linear algebra image, volume, figure • LU, QR, Cholesky, SVD, Eigenvalues, math Inversion, Solvers, Determinant, image processing • sign, sqrt, root, pow, ceil, Matrix Power • filter, rotate, erode, floor, round, trunc, log, exp, dilate, morph, gamma, epsilon, erf, abs, arg, resize, rgb2gray, histograms array manipulation real, etc. • transpose, conjugate, lower, upper, diag, join, flip, tile, flat, shift, reorder device management • info, deviceset, deviceget, devicecount and many more… Header Files • #include <arrayfire.h> • header file for all ArrayFire functions All AF functions are organized into the af namespace * By including only “arrayfire.h” and using namespace af, you have all the ArrayFire functions ready to use Monte Carlo Estimation of Pi #include <stdio.h> #include <arrayfire.h> using namespace af; int main() { // 20 million random samples int n = 20e6; array x = randu(n,1), y = randu(n,1); // how many fell inside unit circle? float pi = 4 * sum<float>(sqrt(mul(x,x)+mul(y,y))<1) / n; printf("pi = %g\n", pi); return 0; } Data Types b8 c32 8-bit boolean byte f32 complex real single precision array single precision f64 container object c64 real complex double precision s32 u32 double precision 32-bit signed integer 32-bit unsigned integer Generating Arrays constant(0, 3) // 3-by-1 column of zeros of single-precision (f32 default) constant(1, 3, 2, f64) // 3-by-2 matrix of ones of double-precision rand(2, 4, u32) // random 32-bit integers, every bit is random randn(2, 2, s32) // square matrix of random values of 32-bit signed integer randu(5, 7, c32) // complex values, all real and complex components identity(3, 3, c64) // 3-by-3 identity of complex double-precision * Almost every function that takes “data type” as parameter sets f32 for default Data Structures Bringing in host data Example: float A[]; A = { 1, 2, 3, 4, 5, 6, 7, 8, ... , 21, 22, 23, 24, 25 } for ( int i=0; i<25; i++) { A[i] = i + 1; } mat = [ 1 2 3 4 5 ] [ 6 7 8 9 10 ] array mat = array(5, 5, A, afHost); [ 11 12 13 14 15 ] [ 16 17 18 19 20 ] [ 21 22 23 24 25 ] N-Dimension Support vectors matrices volumes Related Data Structures seq - Index generator (sequential or strided) dim4 - Array dimensions descriptor timer - Internal timer object Exception How AF handles errors *These are all in namespace “af” Data Structures Tons of ways to slice and dice arrays Adjust the dimensions of an N-D array (fast) array (const array &input, const dim4 &dims) : Adjust input array using dim4 type array (const array &input, int dim0, int dim1=1, int dim2=1, int dim3=1) array (const array &) : Duplicate an existing array (copy constructor). array (const seq &s) : Convert a seq object for use with arithmetic. Multiple Subscripts array operator() (int x) const : Access linear element x. array operator() (int row, int col) const : Access linear element at row and col. array operator() (int x, int y, int z) const : Access element. array operator() (const seq &x, int y, int z) const : Access vector. array operator() (int x, const seq &y, const seq &z) const : Access matrix. array operator() (const seq &w, const seq &x, const seq &y, const seq &z) const : Access Volume. Data Structures Tons of ways to slice and dice arrays Access row/col to form a vector array row (int i) const : Access row i to form row vector. array col (int i) const : Access column i to form column vector. array slice (int i) const : Access slice i to form a matrix (plane of a volume, channel in an image) Operator overloading +, *, -, /, %, +=, -=, *=, /=, %=, &&, ||, ^, ==, !=, <, <=, >, >=, =, << array as (dtype type) const : Cast array's contents to the given type. array T () const : Transpose matrix or vector. array H () const : Conjugate transpose (1+2i becomes 1-2i). Data Structures seq Sequential index generator seq (double n) : Create sequence {0 1 ... n-1} seq (double first, double inc, double last) : Create sequence {first first+inc ... last}. Data Structures array & seq Example: array a(seq(4)); // a = [ 0 1 2 3 ] array b(-seq(4)); // b = [ 3 2 1 0 ] array c(seq(4)+1); // c = [1 2 3 4 ] array d(seq(4)*3); // d = [ 0 3 6 9 ] array e(seq(1, 0.5, 2)); // e = [ 1.0 0.5 2.0 ] array f = e.as(u32); // f = [ 1 1 2 ] f = f - 2; // f = [ -1 -1 0 ] array g = a.copy(); // g = [ 0 1 2 3 ] array h = g.T(); // h = [ 0 ] [ 1 ] [ 2 ] [ 3 ] Data Structures int end : Reference last element in dimension. seq span : Reference entire dimension. A(span,span,2) A(1,1) A(1,span) A(end,1) A(end,span) Data Structures dim4 Array dimensions descriptor dim4 (unsigned x, unsigned y=1, unsigned z=1, unsigned w=1) : If only x is specified, constructs column vector. dims () const : Access underlying array. elements () const : Number of elements (product of dimensions) rest (unsigned i) const : Product of dimensions i on. numdims () const : Number of dimensions set. Operator overloading [], ==, !=, = Data Structures Dim4 special accessors Example: array a = constant(0,4,5,6); a.dims().elements(); // Returns 125 (4x5x6) a.dims().rest(1); // Returns 30 (5x6) a.dims().numdims() // Returns 3 Data Structures timer A platform-independent timer with microsecond accuracy: start() : Starts a timer. stop() : Returns elapsed seconds since last start(). stop (timer start) : Returns elapsed seconds since start. Accurate and reliable measurement of performance involves several factors: Executing enough iterations to achieve peak performance. Executing enough repetitions to amortize any overhead from system timers. Data Structures timer vs. timeit double af::timeit (void(*)() fn) : Robust timing of a function (both CPU or GPU). Basic timing is sometimes inaccurate because GPU takes some time for initialization and loading libraries on first run. timeit() solves this problem by performing multiple function invocations for a robust timing result. Useful method for benchmarking GPU or CPU functions Take a look at an example in the next slide Data Structures timeit static void test_function() { // What you want to time… } int main(int argc, char **argv) { double elapsed_time = timeit(test_function); printf(“test_function took %.6g seconds\n", elapsed_time); } Data Structures C++ Exceptions Exception for ArrayFire exception () what () const throw () : returns the exception as string (char *) &operator<< (std::ostream &s, const exception &e) : prints the error message Example: try { ... } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); throw; } Basic Operations

Load more