ArrayFire: Open Source

An Introduction to the ArrayFire Library ● We make code run faster ○ Started in 2007 by Georgia Tech researchers ArrayFire—The People

● Diverse research background ○ HPC ○ Computer vision ○ ○ Image processing ○ Social network analysis ○ Optical interferometry ○

● From all over: Georgia Tech, U. Penn., Clemson, GSU, Texas A&M, UCSD, and more

● Passionate about making things faster ArrayFire—R&D ● Case studies on image analysis for massive datasets

● Large scale triangle counting on multiple GPUs

● Virtual fitting of glasses

● Computer aided logo detection

● Sleep disorder diagnosis ArrayFire—The Library

● The Library ○ GPU accelerated high performance library ○ Extensive of built in functions ○ Applicable for numerous domains ■ Science ■ Engineering ■ Finance ■ Biology ○ Great for mobile and embedded computing Libraries are Great Eliminate Hidden Costs Library Types

● Specialized GPU library ○ Targeted at a specific set of operators (functionality) ■ Precision tools ○ Optimized for specific systems ○ -like interface ○ Raw pointer interface

Images taken from:

http://classroom.synonym.com/learn-surgical-instruments-2429.html Library Types (cont.)

● General GPU library ○ Manage GPU resources using containers

○ Applicable to a large set of applications and domains

○ Portable across multiple architectures

○ Higher level functions

○ Can make use of specialized GPU libraries

Images taken from: http://wordlesstech.com/2010/12/09/swiss-army-knife-giant-elite/ ArrayFire Capabilities

● Hundreds of parallel functions for multi-disciplinary work ○ Image processing ○ Machine learning ○ Graphics ○ Sets ● Support for multiple languages ○ C/C++, , Java and R ● Multiple backends - great for portability ○ CUDA ○ OpenCL ○ CPU (requires gcc) ArrayFire Capabilities (cont.)

● Linux, Windows, Mac OS X

● OpenGL based graphics

● JIT ○ Combine multiple operations into one kernel ArrayFire Functions

● Hundreds of highly-optimized parallel functions ○ Signal/image processing ■ Convolution ■ FFT ■ Histograms ■ Interpolation ■ Connected components ○ Linear Algebra ■ Matrix multiply ■ Linear system solving ■ Factorization ArrayFire Functions

● Supports hundreds of parallel functions ○ Building blocks ■ Reductions ■ Scan ■ Set operations ■ Sorting ■ Statistics ■ Basic matrix manipulation

Images taken from: http://technogems.blogspot.com/2011/06/sorting-included-files-by-importance.html http://www.cmsoft.com.br/tutorialOpenCL/CLMatrixMultExplanationSubMatrixes.png ArrayFire—Data

● Built around a flexible data named "array" ○ Lightweight wrapper around the data on the compute device ○ Manages the data and basic metadata such as size, type and dimensions ● You can transfer data into an array using constructors ● Column major float hA[6] = {0, 1, 2, 3, 4, 5}; array A(2, 3, hA); Introducing...

Open source! https://www.github.com/arrayfire/arrayfire ArrayFire—Indexing

#include #include void af_example() { float f[8] = {1, 2, 4, 8, 16, 32, 64, 128}; array a(2, 4, f); // 2 rows x 4 col array initialized with f values array sumSecondCol = sum(a(span, 1)); // reduce-sum over the second column print(sumSecondCol); // 12 } ArrayFire Example—Swap R and B

array tmp = img(span,span,0); // save the R channel img(span,span,0) = img(span,span,2); // R channel gets values of B img(span,span,2) = tmp; // B channel gets of R

array swapped = join(2, img(span,span,2), // blue img(span,span,1), // green img(span,span,0)); // red array swapped = img(span,span,seq(2,-1,0)); ArrayFire Functions Original Grayscale Box Filter Blur Gaussian Blur Image Negative Erosion

ArrayFire // erode an image, 8-neighbor connectivity array mask8 = constant(1,3, 3); array img_out = erode(img_in, mask8);

// erode an image, 4-neighbor connectivity const float h_mask4[] = { 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0 }; array mask4 = array(3, 3, h_mask4); array img_out = erode(img_in, mask4); Erosion Filtering

ArrayFire array R = convolve(img, ker); // 1, 2 and 3d convolution filter array R = convolve(fcol, frow, img); // Separable convolution array R = filter(img, ker); // 2d correlation filter Histograms

ArrayFire int nbins = 256; array hist = histogram(img,nbins); Transforms

ArrayFire array half = resize(0.5, img); array rot90 = rotate(img, af::Pi/2); array warped = approx2(img, xLocations, yLocations); Image smoothing

ArrayFire array S = bilateral(I, sigma_r, sigma_c); array M = meanshift(I, sigma_r, sigma_c, iter); array R = medfilt(img, 3, 3);

// Gaussian blur array gker = gaussiankernel(ncols, ncols); array res = convolve(img, gker); FFT

FFT (API) array R1 = fft2(I); // 2d fft. check fft, fft3 array R2 = fft2(I, M, N); // fft2 with padding array R3 = ifft2(fft2(I, M, N) * fft2(K, M, N)); // convolve using fft2 JIT Code Generation

● Run time kernel generation

● Combines multiple element wise operations into one kernel

● Reduces number of kernel launches

● Improves cache performance ○ Intermediate data not allocated JIT—Monte Carlo Pi Simulation array temp = (sqrt(x*x+y*y)<1); return sum(temp); No JIT: ● 5 function calls (2”*”, 1 “+”, “ ”,1 “<”, 1 “√”)

With JIT: ● One function call ● Fewer cache misses ○ “Capacity” cache misses avoided JIT Performance

Results are from a K20 GPU

Lower is better Additional Performance Results OpenCV

● Open source computer vision library

● C++ interface

● Some CUDA supported functionality OpenCV—ArrayFire Interoperability

● Helper Functions ○ https://github.com/arrayfire-community/arrayfire_opencv.git

Mat R; // OpenCV API Rodrigues(poses(Rect(0, 0, 1, 3)), R); // OpenCV function call af::array af_R = mat_to_array(R); // Results transferred to array data-structure Feature Tracking—FAST Corner Detection

● In comparison with OpenCV CPU implementation ● Larger data-sets can be analyzed ● Single-threaded CPU implementation Feature Tracking—Harris

● Feature Detection ● Better performance ● Larger data-sets can be analyzed ● Single-threaded CPU implementation Feature Tracking—ORB Speedup

● Better performance ● Uses the FAST algorithm ○ At multiple scales ● Single-threaded CPU implementation Feature Tracking—Example (ORB)

Speedup for sample: 21.6x Performance Results—Bilateral Filtering Performance Results—Convolution Performance Results—Image Rotate Performance Results—Bilinear Image Resize Performance insights

● Large images - GPUS are ideal.

● Smaller images - GPUs are great, but better performance can be achieved through batching and . ○ With batching or data parallelism, GPU are also ideal. Conway’s Game of Life

// Convolve gets neighbors af::array nHood = convolve(state, kernel, false);

// Generate conditions for life af::array C0 = (nHood == 2); af::array C1 = (nHood == 3);

// Update state state = state * C0 + C1; Open Source Roadmap

● Create capabilities for ○ Streaming video ○ Large number of images ○ Machine learning ○ Data analysis ○ Dynamic data

● Faster visualization/rendering utilities for large scale data sets Open Source Roadmap

● Support for sparse linear algebra

● Additional language wrappers

● More machine learning and computer vision functions

● Additional “big-data” infrastructure Look Us Up

Company website: www.arrayfire.com

Open source library: https://github. com/arrayfire/arrayfire

Language wrapper {Fortran, R, Java}: https://github.com/arrayfire/arrayfire_$lang Q & A Speaker: Oded Green ([email protected])

Engineer: Shehzan Mohammed ([email protected])

Sales: Scott Blakeslee ([email protected]) TEST DRIVE GPU ACCELERATORS TODAY Accelerate your scientific discoveries: ✓ Reducing simulation time from hours to minutes

✓ Using the latest Tesla K80 GPUs

FREE GPU Trial at: www.nvidia.com/GPUTestDrive UPCOMING LIVE GTC EXPRESS WEBINAR

Thursday, December 18 Photorealistic Visualization with Speed and Ease Using Iray+ for Autodesk 3ds Max David Coldron, Lightwork Design and Peter de Lappe, NVIDIA

ON-DEMAND GTC EXPRESS WEBINARS More than 100 GTC Express Webinar recordings.

www.gputechconf.com/gtcexpress REGISTRATION IS OPEN! 20% OFF

March 17-20, 2015 | San Jose, CA GM15WEB www.gputechconf.com #GTC15

CONNECT LEARN DISCOVER INNOVATE Connect with experts Get key learnings and Discover the latest Hear about disruptive from NVIDIA and other hands-on training in the technologies shaping innovations as early-stage organizations across a 400+ sessions and 150+ the GPU ecosystem start-ups present their work wide range of fields research posters