S3012 - Simplifying Portable Killer Apps with OpenACC and CUDA-5 Concisely and Efficiently Rob Farber Chief Scientist, BlackDog Endeavors, LLC Author, “CUDA Application Design and Development” Research consultant: ICHEC, Fortune 100 companies, and others Scientist: .

Dr. Dobb’s Journal CUDA & OpenACC tutorials

• OpenCL “The Code Project” tutorials • Columnist Scientific Computing, and other venues The three pillars of science • Scientists spend most their time working on computers • The last five years have revolutionized computing • Let’s briefly look at how this has happened. From games to supercomputers GPUs evolved from pushing pixels CPUs evolved from running applications • Failure of Dennard’s scaling laws caused switch to multicore • Farber, “Intel's 50+ core MIC architecture: HPC on a Card or Massive Co-Processor?”

ZZZ Xeon Phi Larrabee zZz (MIC)

CUDA OpenCL OpenCL 1/3100 Billion+ million+? GPUs GPUs Supercomputing for the masses! • Market forces evolved GPUs into massively parallel GPGPUs (General Purpose GPUs). • 400+ million CUDA-enabled GPUs says it all! • CUDA: put supercomputing in the hands of the masses – December 1996, ASCI Red the first teraflop supercomputer – Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s • GTX 560 $60 USD on ebay

Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables: • New thinking • A large educated base of developers 4 You can change the world! GPUs enable killer apps! • Orders of magnitude faster apps/Low power apps:

– 10x can make computational workflows more interactive (even poorly performing GPU apps are useful). – 100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers. – 1000x and greater achieved through the use of the NVIDIA SFU (Special Function Units) or multiple GPUs … Whooo Hoooo!

Two big ideas: 1. SIMD 2. A strong scaling execution model

5 Big hardware idea 1: SIMD Kepler K20 High-performance from the past • Space and power efficient • Long life via a simple model

The Connection Machine Works great on multi-core MPI systems! Observed Peak Effective Rate vs. Number of Ranger Cores

400

350

300

250

200

150 Effective RateEffective (TF/s) Farber: general SIMD mapping : 100 60,000 cores: 363 TF/s measured 50 62,796 cores: 386 TF/s (projected) “Most efficient implementation to date” 0 0 10000 20000 30000 40000 50000 60000 70000 (Singer 1990), (Thearling 1995) Number of Barcelona cores

6 Results presented at SC09 (courtesy TACC) Scalability required to use all those cores (strong scaling execution model) • Threads can only communicate within a thread block – (yes, there are atomic ops) • Fast hardware scheduling – Both Grid and on SM/SMX You can see the need for scalability

Intel Sandy Bridge Core I7 3960X (6 core)

AMD 7970 NVIDIA Fermi NVIDIA Kepler K20 (1,280 work-item) (1,536 CUDA cores) (2,880 CUDA cores)

Knights Corner (244 core) Assert that a strong scaling execution model is required to run on future massively parallel devices

Big idea 2: A strong scaling execution model! • Four basic types of programming models: – Language platforms based on a strong-scaling execution model (CUDA and OpenCL™) – Directive-based programming like OpenMP and OpenACC • Note: OpenACC can utilize a strong scaling execution model. – Common libraries providing FFT and BLAS functionality – MPI (Message Passing Interface) • Perfect strong scaling decreases runtime linearly by the number of processing elements MIC differs from GPUs 64 cores on the die • Somewhere between 50 and 64 cores activated depending on yields and the clock speeds • Expect 50 and 64 cores running at 1.2GHz to 1.6GHz (: the Register)

• Uses a per core vector unit for

Image source (one comment removed) : http://www.hpcwire.com/hpcwire/2012- high flops rate 04-03/nvidia_pokes_holes_in_intel_s_manycore_story.html • Assumed 8 GB per PCIe card Flexible but strong scalability is not guaranteed (“Program with lots of threads that use vectors”) 61 cores with a wide per core vector unit

Floating-point performance comes from the per core vector unit

wide Core 512 wide vector unit Corewide wideSSE wideSSE Ring SSE SSE interconnect Illustration

Core 512 wide vector unit Core 512 wide vector unit

Core 512 wide vector unit Assume the performance of 61, 1.2 – 1.6 GHz Pentiums when the wide vector unit is not used

Similarly assume the performance of a single 1.2-1.6 GHz Pentium core on sequential portions of code (Amdahl’s Law)

Core 512 wide vector unit Core 512 wide vector unit Ring interconnect

Core 512 wide vector unit Core 512 wide vector unit

Core 512 wide vector unit Four general programming models 1. Language platforms based on a strong-scaling execution model (CUDA and OpenCL™) 2. Directive-based programming like OpenMP and OpenACC • Note: OpenACC can utilize a strong scaling execution model. 3. Common libraries providing FFT and BLAS functionality 4. MPI (Message Passing Interface) OpenACC C language programming /* matrix-omp.c */ /* matrix-acc.c */ int main() int main() { { … …

// Compute matrix multiplication. // Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; c[i][j] += a[i][k] * b[k][j]; } } } } } } return 0; return 0; } }

Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC”

OpenACC Fortran anyone? /* matrix-omp.c */ ! matrix-acc.f program example1 int main() … { … !$acc data copyin(a,b) copy(c) !$acc kernels loop // Compute matrix multiplication. ! Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) do i=1, n_size for (i = 0; i < SIZE; ++i) { do j=1, n_size for (j = 0; j < SIZE; ++j) { do k = 1, n_size for (k = 0; k < SIZE; ++k) { c(i,j) = c(i,j) + a(i,k) * b(k,j) c[i][j] += a[i][k] * b[k][j]; enddo } enddo } enddo } !$acc end data return 0; end program example1 }

Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC”

OpenACC adds the concept of device memory Example NVIDIA Visual Profiler (nvvp) timeline from “Introducing OpenACC”

Move matrices a,b, and c Move matrix c to the host to the coprocessor (GPU) Perform the matrix multiply (line 24 in main)

Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC”

Three rules for fast GPU/co-processor codes 1. Get the data on the device (and keep it there!) • PCIe x16 v2.0 bus: 8 GiB/s in a single direction • 20-series GPUs: 140-200 GiB/s 2. Give the device enough work to do • Assume 2 ms latency and 1 TF device • Can waste (2 x 10-6 * 1012) = 2M operations 3. Reuse and locate data to avoid global memory bandwidth bottlenecks • 103 Gflop/s hardware can deliver 10 Gflop/s when global memory limited • Causes a 100x slowdown! Corollary: Avoid malloc/free! Research: TLP can help with nested parallelism! Square matrix multiply: Which loops are faster or is there a difference?

for (int i = 0; i < size; ++i) for (int i = 0; i < size; ++i) for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) for (int k = 0; k < size; ++k) for (int j = 0; j < size; ++j) tmp += A[i][k] * B[k][j]; C[i][j] += A[i][k] * B[k][j]; C[i][j] = tmp; } OpenACC OpenMP OpenACC OpenMP Rearranged Rearranged OpenACC Conventional Conventional OpenACC Run Loops Loops speedup Run Loops Loops Speedup 0.04298 0.12139 2.82 0.045108 2.9749 65.95 0.041681 0.13461 3.23 0.043823 2.6862 61.30 0.041697 0.13055 3.13 0.043793 2.6802 61.20 Average 3.06 Average 62.82 Dynamic nested parallelism is even worse! http://www.drdobbs.com/parallel/creating-and-using-libraries-with-openac/240012502 OpenACC “Hello World” to exascale int main() double objFunc( ... ) { { double err=0.; cout << "Hello World" << endl; #pragma acc parallel loop reduction(+:err) // load data and initialize parameters #pragma omp parallel for reduction(+ : err) init(); { err = 0.; #pragma acc data \ for(int i=0; i < nExamples; i++) { copyin(param[0:N_PARAM-1]) \ // transform pcopyin(example[0:nExamples*EXAMPLE_SIZE-1]) float d=myFunc(i, param, example, nExamples, NULL); { //reduce optimize( objFunc ); // the optimizer calls the objective function err += d*d; } } } return 0; return sqrt(err); } }

DATA Exascale Capable! Optimize an “objective function”

Applicable to a general class of optimization problems – Locally Weighted Linear Regression (LWLR) – Neural Networks – Naive Bayes (NB) – Gaussian Discriminative Analysis (GDA) – k-means – Logistic Regression (LR) – Independent Component Analysis (ICA) – Expectation Maximization (EM) – Support Vector Machine (SVM) – Others: (MDS, Ordinal MDS, etcetera) A general mapping: energy = objFunc(p1, p2, … pn)

Optimization Method (Powell, Conjugate Gradient, Other) Step1 Step 2 Step 3 Broadcast Calculate partials Sum partials to parameters get energy Host

GPU 1 GPU 2 GPU 3 GPU 4 p1,p2, … pn p1,p2, … pn p1,p2, … pn p1,p2, … pn Examples Examples Examples Examples 0, N-1 N, 2N-1 2N, 3N-1 3N, 4N-1

22 See a path to exascale (MPI can map to hundreds of GPUs)

• Over 350TF/s of performance on Longhorn (including communications!) • Dominant runtime of code that scales to 500 GPUs • 600+ GF/s per K20 • Expect tens of petaflop/s average performance from Titan

푇표푡푎푙푂푝퐶표푢푛푡 Always report “Honest Flops” 퐸푓푓푒푐푡푖푣푒푅푎푡푒 = 푇푏푟표푎푑푐푎푠푡 +푇표푏푗푒푐푡푓푢푛푐 +푇푟푒푑푢푐푒 23 Important design concept #1 • Make your life easy – Use the highest level interface first • Delve down into lower level programming when – You need higher performance – The high level API does not do what you want » It is necessary to use a lower level capability » Make use of some hardware feature “Computational Universality” An XOR Neural Network • The example of XOR nicely emphasizes the G(x) importance of hidden neurons:

• They re-represent the input such that the

problem becomes linearly separable.

• Networks with hidden units can implement any Boolean function -> Computational Universal devices!

• Networks without hidden units cannot learn XOR • NetTalk Cannot represent large classes of problems Sejnowski, T. J. and Rosenberg, C. R. (1986) NETtalk: a parallel "Applications of Neural network that learns to read aloud, Cognitive Science, 14, 179-211 Net and Other Machine http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network) Learning Algorithms to DNA Sequence Analysis", 500 learning loops Finished (1989). Application to Bioinformatics NetTalk "Applications of Neural Net and Other Machine Sejnowski, T. J. and Rosenberg, C. R. (1986) Learning Algorithms to DNA Sequence Analysis", NETtalk: a parallel network that learns to read A.S. Lapedes, C. Barnes, C. Burks, R.M. Farber, K. aloud, Cognitive Science, 14, 179-211 Sirotkin, Computers and DNA, SFI Studies in the http://en.wikipedia.org/wiki/NETtalk_(artificial_neura Sciences of Complexity, vol. VII, Eds. G. Bell and l_network) T. Marr, Addison-Wesley, (1989).

The phoneme to be pronounced T|F Exon region

Internal Internal connections connections

t e X t A C G T T Predicting binding affinity (The closer you look the greater the complexity)

Electron Microscope The question for computational biology

• How do we know you are not playing expensive computer games with our money? Utilize a blind test

Binding affinity for a specific antibody

Internal connections

Possible hexamers 1k – 2k pseudo-random A A A A A A 0 1 2 3 4 5 206 = 64M (hexamer, binding) affinity pairs

“Learning Affinity Landscapes: Prediction of Novel Peptides”, Alan Lapedes and Robert Farber, Los Alamos National Laboratory Approx. 0.001% Technical Report LA-UR-94-4391 (1994). sampling Hill climbing to find high affinity

Confirm experimentally

퐴푓푓푖푛푖푡푦퐴푛푡푖푏표푑푦

Predict P,C,T,N,S,L has the highest binding affinity Internal connections Learn: 퐴푓푓푖푛푖푡푦퐴푛푡푖푏표푑푦 = 푓 퐴 , … , 퐴 0 5 푓(P,C,T,N,S,L)

A0 A1 A2 A3 A4 A5 푓(F,F,F,F,F,V) 푓(F,F,F,F,L,L) 푓(F,F,F,F,F,L) 푓(F,F,F,F,F,F) Two important points • The computer appears to correctly predict experimental data • Demonstrated that complex binding affinity relationships can be learned from a small set of samples – Necessary because it is only possible to sample a very small subset of the binding affinity landscape for drug candidates 1995 drug design hardware vs 2013 (analyzed all available chemical … TB of data) • Quad-core 512 MB Sun workstation – My Samsung S3 is more powerful and has 2 GB RAM • 80 GB disk and a TB DLT tape stacker – A TB laptop hard drive You can change the world! $30M of hardware replaced • 60 Gflop/s Connection machine by a GPU accelerated laptop – A mobile GeForce GPU

Example: PCA (Principle Components Analysis) • Widely used in data-mining and data reduction – Discuss a method proposed by Sanger (1989) • Extends to Nonlinear PCA (NLPCA) – Discuss a method by E. Oja, J. Harhunen, L. Wang, and R. Vigario (1995)

O O O O O O O O O O O O O O O O O O O O O O O O O

• The general mapping scales B B B according to data • Exascale capable! • Provides the ability to I I I I I I I I I I I I I I I I I I I I I I I I I compare Linear and Nonlinear performance33

Intel Xeon Phi runs • Great from a code portability point of view • Watch out for Jitter! Read: “The Case of the Missing Supercomputer Performance”

140

120

100

Runtime variations

80 (Green dot is average)

Gflop/s 60

40

20

0

1

73 13 25 37 49 61 85 97

313 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 325 337 349 361 373 385 397 409 421 433 445 457 469 Number of OpenMP Threads G() = G() PCA

1200 (Principle Components Analysis) 2x10x1x10x2 autoencoder 1000

800 Looking forward to trying a K20X

with a current

600 PCIe chipset GF/s PCA Xeon Phi Native PCA Xeon Phi offload 400 PCA k20c PCA Host (24 core Westmere)

200

0 0 5 10 15 20 25 30 35 Number of data examples (millions) G() G() = NLPCA (Nonlinear PCA) 800 2x10x1x10x2 autoencoder 700 Yes, performance did increase slightly

600 NLPCA K20c NLPCA Xeon Phi Native

500 NLPCA Xeon Phi offload

NLPCA Host (24 core Westmere)

400 GF/s

300

200

100

0 0 5 10 15 20 25 30 35 Number of data examples (millions) Love those SFUs! (Special Function Units) • Fast transcendental functions – The world is nonlinear … so are many computational models

TF/s devices open the door to new topics • Works great for manufacturing optimization – Best product for lowest cost of materials – Works great for color matching • Multiterm objective functions – Best design for the lowest (cost, weight, {your metric here}, …) – A teraflop/s per device can run many optimizations to map the decision space. • Machine learning with memory or variable inputs  – Recurrent neural networks, IIR filters, …. – Have to iterate the network during training You can change the world! Data handling can take as much time as the computational problem! • Longhorn GPU capabilities – 2,048 GB of GPU memory in 512 Quadro FX 5800 GPUs

• ORNL Titan – 112,128 GB of GPU memory in 18,688 K20x GPUs

Expect 600+ GF/s per device • Need: { *big number* here} 1. Fast and scalable data load Average sustained performance 2. Fast and scalable, heterogeneous, flexible and robust data preprocessing workflows • What a mouthful! Big data social media • Need a simplifying framework – A laptop can represent a billion node graph – People don’t understand billion node graphs! • Million node graphs are not comprehensible • Thousand node graphs are too complex • Hundred node graphs are still too big • A few to tens of nodes are potentially understandable

Validate against 3rd party expertsSorry, and part machine of my next metrics talk S3443• Understand - Clicking GPUs this isinto a lens a Portable, looking intoPersistent a social and reality Scalable Massive Data Framework • Cannot forget that the computer only represents reality! Time: 15:00 - 15:50 Location: Room 230B Important design concept #2 • Try to maintain just one source tree – OpenACC/OpenMP pragmas are interesting

(disclaimer/shameless commerce: I’m writing an OpenACC book) OpenACC portability

/* matrix-acc.c */ ! matrix-acc.f int main() program example1 { int main() … cout << "Hello World" << endl; { … !$acc data copyin(a,b) copy(c) !$acc kernels loop // load data and initialize parameters // Compute matrix multiplication. ! Compute matrix multiplication. init(); #pragma acc kernels copyin(a,b) copy(c) do i=1, n_size for (i = 0; i < SIZE; ++i) { do j=1, n_size #pragma acc data \ for (j = 0; j < SIZE; ++j) { do k = 1, n_size for (k = 0; k < SIZE; ++k) { c(i,j) = c(i,j) + a(i,k) * b(k,j) copyin(param[0:N_PARAM-1]) \ c[i][j] += a[i][k] * b[k][j]; enddo pcopyin(example[0:nExamples*EXAMPLE_SIZE-1]) } enddo { } enddo optimize( objFunc ); // the optimizer calls the objective function } !$acc end data } return 0; end program example1 } return 0; Fortran } C C++

CAPS Demo at SC12 via OpenCL Coprocessor and GPU demos shown at SC12 by PGI and CAPS translation Lessons • Use the highest level interface first – Delve down into lower level programming when • You need higher performance • The high level API does not do what you want • Use a single source tree

OpenACC source tree

Translate C/C++ Fortran CUDA to OpenCL Others Legacy Legacy File

New New File Will OpenCL match CUDA-5 features like dynamic parallelism? – Part of the OpenACC version 2 specification – Necessary for divide-and-conquer problems

CUDA + Primitive Restart (a potent combination!) Primitive restart: – A feature of OpenGL 3.1 – Roughly 60x faster than optimized OpenGL – Avoids the PCIe bottleneck – Variable length data works great! LiDAR: 131M points 15 – 33 FPS (C2070)

In collaboration44 with Global Navigation Sciences (http://http://globalnavigationsciences.com/ “Primitive” means a primitive OpenGL op

glPrimitiveRestartIndex(TAG); glEnableClientState(GL_PRIMITIVE_RESTART_NV); glDrawElements(GL_LINE_STRIP, qIndexSize, GL_UNSIGNED_INT, qIndices);

qIndices A B TAG C D E

pos[] A(x,y,z) B(x,y,z) C(x,y,z) D(x,y,z ) E(x,y,z) Primitive 1 Primitive 2

B A D C E Conventional OpenGL workflow

1. Host generates data 2. Host issues draw Data operation(s) 3. Image appears Primitive restart OpenGL workflow

Map to OpenGL buffer

Benefits • Rule 1: Avoid the 1. Map OpenGL buffer PCIe bus! 2. Run Kernel KernelDraw • Exploit the massive 3. Sync with host parallel 4. Host issues a primitive performance of the draw operation device 5. Image appears 6. Unmap OpenGL buffer Primitive Restart generates better quality images • Rendering performance can be optimized by arranging the indices to achieve the highest reuse in the texture units. • Higher quality images can be created by alternating the direction of tessellation

– Old – New

Try to optimize the drawing operations to make the best use of the texture cache • Performance depends on spatial locality

• Z-curve 48 Interactive 100+M LiDar data points!

• Worst case: each data point is recalculated – Useful for onboard triangulation – Custom metrics – Etcetera • With a simple modification of the Chapter 9 “CUDA Application Design and Development” example code

Sorry, part of my next talk S3443 - Clicking GPUs into a Portable, Persistent and Scalable Massive Data Framework

Time: 15:00 - 15:50 Location: Room 230B For the demo, think Kinect and 3D morphing for augmented reality (identify flesh colored blobs for hands)

Artifacts caused by picking a colorspace rectangle rather than an ellipse The entire segmentation method __global__ void kernelSkin(float4* pos, uchar4 *colorPos, unsigned int width, unsigned int height, int lowPureG, int highPureG, int lowPureR, int highPureR) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; int r = colorPos[y*width+x].x; int g = colorPos[y*width+x].y; int b = colorPos[y*width+x].z; int pureR = 255*( ((float)r)/(r+g+b)); Sedláček, M. (2004). Evaluation of RGB int pureG = 255*( ((float)g)/(r+g+b)); and HSV Models in Human Faces if( !( (pureG > lowPureG) && (pureG < highPureG) Detection. Central European Seminar on && (pureR > lowPureR) && (pureR < highPureR) Computer Graphics, Budmerice. ) ) CompSysTech’2004 , (pp. 125-131). colorPos[y*width+x] = make_uchar4(0,0,0,0); } Manipulating real-time video (Chapter 12 source code ) Thank you!

Rob Farber Chief Scientist, BlackDog Endeavors, LLC Author, “CUDA Application Design and Development” Research consultant: ICHEC, Fortune 100 companies, and others Scientist: .

Dr. Dobb’s Journal CUDA & OpenACC tutorials

• OpenCL “The Code Project” tutorials • Columnist Scientific Computing, and other venues