Openacc, CUDA 5

S3012 - Simplifying Portable Killer Apps with OpenACC and CUDA-5 Concisely and Efficiently Rob Farber Chief Scientist, BlackDog Endeavors, LLC Author, “CUDA Application Design and Development” Research consultant: ICHEC, Fortune 100 companies, and others Scientist: . Dr. Dobb’s Journal CUDA & OpenACC tutorials • OpenCL “The Code Project” tutorials • Columnist Scientific Computing, and other venues The three pillars of science • Scientists spend most their time working on computers • The last five years have revolutionized computing • Let’s briefly look at how this has happened. From games to supercomputers GPUs evolved from pushing pixels CPUs evolved from running applications • Failure of Dennard’s scaling laws caused switch to multicore • Farber, “Intel's 50+ core MIC architecture: HPC on a Card or Massive Co-Processor?” <mumble> ZZZ Xeon Phi Larrabee zZz (MIC) CUDA OpenCL OpenCL 1/3100 Billion+ million+? GPUs GPUs Supercomputing for the masses! • Market forces evolved GPUs into massively parallel GPGPUs (General Purpose GPUs). • 400+ million CUDA-enabled GPUs says it all! • CUDA: put supercomputing in the hands of the masses – December 1996, ASCI Red the first teraflop supercomputer – Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s • GTX 560 $60 USD on ebay Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables: • New thinking • A large educated base of developers 4 You can change the world! GPUs enable killer apps! • Orders of magnitude faster apps/Low power apps: – 10x can make computational workflows more interactive (even poorly performing GPU apps are useful). – 100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers. – 1000x and greater achieved through the use of the NVIDIA SFU (Special Function Units) or multiple GPUs … Whooo Hoooo! Two big ideas: 1. SIMD 2. A strong scaling execution model 5 Big hardware idea 1: SIMD Kepler K20 High-performance from the past • Space and power efficient • Long life via a simple model The Connection Machine Works great on multi-core MPI systems! Observed Peak Effective Rate vs. Number of Ranger Cores 400 350 300 250 200 150 Effective RateEffective (TF/s) Farber: general SIMD mapping : 100 60,000 cores: 363 TF/s measured 50 62,796 cores: 386 TF/s (projected) “Most efficient implementation to date” 0 0 10000 20000 30000 40000 50000 60000 70000 (Singer 1990), (Thearling 1995) Number of Barcelona cores 6 Results presented at SC09 (courtesy TACC) Scalability required to use all those cores (strong scaling execution model) • Threads can only communicate within a thread block – (yes, there are atomic ops) • Fast hardware scheduling – Both Grid and on SM/SMX You can see the need for scalability Intel Sandy Bridge Core I7 3960X (6 core) AMD 7970 NVIDIA Fermi NVIDIA Kepler K20 (1,280 work-item) (1,536 CUDA cores) (2,880 CUDA cores) Knights Corner (244 core) Assert that a strong scaling execution model is required to run on future massively parallel devices Big idea 2: A strong scaling execution model! • Four basic types of programming models: – Language platforms based on a strong-scaling execution model (CUDA and OpenCL™) – Directive-based programming like OpenMP and OpenACC • Note: OpenACC can utilize a strong scaling execution model. – Common libraries providing FFT and BLAS functionality – MPI (Message Passing Interface) • Perfect strong scaling decreases runtime linearly by the number of processing elements MIC differs from GPUs 64 cores on the die • Somewhere between 50 and 64 cores activated depending on yields and the clock speeds • Expect 50 and 64 cores running at 1.2GHz to 1.6GHz (Source: the Register) • Uses a per core vector unit for Image source (one comment removed) : http://www.hpcwire.com/hpcwire/2012- high flops rate 04-03/nvidia_pokes_holes_in_intel_s_manycore_story.html • Assumed 8 GB per PCIe card Flexible but strong scalability is not guaranteed (“Program with lots of threads that use vectors”) 61 cores with a wide per core vector unit Floating-point performance comes from the per core vector unit wide Core 512 wide vector unit Corewide wideSSE wideSSE Ring SSE SSE interconnect Illustration Core 512 wide vector unit Core 512 wide vector unit Core 512 wide vector unit Assume the performance of 61, 1.2 – 1.6 GHz Pentiums when the wide vector unit is not used Similarly assume the performance of a single 1.2-1.6 GHz Pentium core on sequential portions of code (Amdahl’s Law) Core 512 wide vector unit Core 512 wide vector unit Ring interconnect Core 512 wide vector unit Core 512 wide vector unit Core 512 wide vector unit Four general programming models 1. Language platforms based on a strong-scaling execution model (CUDA and OpenCL™) 2. Directive-based programming like OpenMP and OpenACC • Note: OpenACC can utilize a strong scaling execution model. 3. Common libraries providing FFT and BLAS functionality 4. MPI (Message Passing Interface) OpenACC C language programming /* matrix-omp.c */ /* matrix-acc.c */ int main() int main() { { … … // Compute matrix multiplication. // Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; c[i][j] += a[i][k] * b[k][j]; } } } } } } return 0; return 0; } } Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC” OpenACC Fortran anyone? /* matrix-omp.c */ ! matrix-acc.f program example1 int main() … { … !$acc data copyin(a,b) copy(c) !$acc kernels loop // Compute matrix multiplication. ! Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) do i=1, n_size for (i = 0; i < SIZE; ++i) { do j=1, n_size for (j = 0; j < SIZE; ++j) { do k = 1, n_size for (k = 0; k < SIZE; ++k) { c(i,j) = c(i,j) + a(i,k) * b(k,j) c[i][j] += a[i][k] * b[k][j]; enddo } enddo } enddo } !$acc end data return 0; end program example1 } Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC” OpenACC adds the concept of device memory Example NVIDIA Visual Profiler (nvvp) timeline from “Introducing OpenACC” Move matrices a,b, and c Move matrix c to the host to the coprocessor (GPU) Perform the matrix multiply (line 24 in main) Farber, “Pragmatic Parallelism Part 1: Introducing OpenACC” Three rules for fast GPU/co-processor codes 1. Get the data on the device (and keep it there!) • PCIe x16 v2.0 bus: 8 GiB/s in a single direction • 20-series GPUs: 140-200 GiB/s 2. Give the device enough work to do • Assume 2 ms latency and 1 TF device • Can waste (2 x 10-6 * 1012) = 2M operations 3. Reuse and locate data to avoid global memory bandwidth bottlenecks • 103 Gflop/s hardware can deliver 10 Gflop/s when global memory limited • Causes a 100x slowdown! Corollary: Avoid malloc/free! Research: TLP can help with nested parallelism! Square matrix multiply: Which loops are faster or is there a difference? for (int i = 0; i < size; ++i) for (int i = 0; i < size; ++i) for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) for (int k = 0; k < size; ++k) for (int j = 0; j < size; ++j) tmp += A[i][k] * B[k][j]; C[i][j] += A[i][k] * B[k][j]; C[i][j] = tmp; } OpenACC OpenMP OpenACC OpenMP Rearranged Rearranged OpenACC Conventional Conventional OpenACC Run Loops Loops speedup Run Loops Loops Speedup 0.04298 0.12139 2.82 0.045108 2.9749 65.95 0.041681 0.13461 3.23 0.043823 2.6862 61.30 0.041697 0.13055 3.13 0.043793 2.6802 61.20 Average 3.06 Average 62.82 Dynamic nested parallelism is even worse! http://www.drdobbs.com/parallel/creating-and-using-libraries-with-openac/240012502 OpenACC “Hello World” to exascale int main() double objFunc( ... ) { { double err=0.; cout << "Hello World" << endl; #pragma acc parallel loop reduction(+:err) // load data and initialize parameters #pragma omp parallel for reduction(+ : err) init(); { err = 0.; #pragma acc data \ for(int i=0; i < nExamples; i++) { copyin(param[0:N_PARAM-1]) \ // transform pcopyin(example[0:nExamples*EXAMPLE_SIZE-1]) float d=myFunc(i, param, example, nExamples, NULL); { //reduce optimize( objFunc ); // the optimizer calls the objective function err += d*d; } } } return 0; return sqrt(err); } } DATA Exascale Capable! Optimize an “objective function” Applicable to a general class of optimization problems – Locally Weighted Linear Regression (LWLR) – Neural Networks – Naive Bayes (NB) – Gaussian Discriminative Analysis (GDA) – k-means – Logistic Regression (LR) – Independent Component Analysis (ICA) – Expectation Maximization (EM) – Support Vector Machine (SVM) – Others: (MDS, Ordinal MDS, etcetera) A general mapping: energy = objFunc(p1, p2, … pn) Optimization Method (Powell, Conjugate Gradient, Other) Step1 Step 2 Step 3 Broadcast Calculate partials Sum partials to parameters get energy Host GPU 1 GPU 2 GPU 3 GPU 4 p1,p2, … pn p1,p2, … pn p1,p2, … pn p1,p2, … pn Examples Examples Examples Examples 0, N-1 N, 2N-1 2N, 3N-1 3N, 4N-1 22 See a path to exascale (MPI can map to hundreds of GPUs) • Over 350TF/s of performance on Longhorn (including communications!) • Dominant runtime of code that scales to 500 GPUs • 600+ GF/s per K20 • Expect tens of petaflop/s average performance from Titan 푇표푡푎푙푂푝퐶표푢푛푡 Always report “Honest Flops” 퐸푓푓푒푐푡푖푣푒푅푎푡푒 = 푇푏푟표푎푑푐푎푠푡 +푇표푏푗푒푐푡푓푢푛푐 +푇푟푒푑푢푐푒 23 Important design concept #1 • Make your life easy – Use the highest level interface first • Delve down into lower level programming when – You need higher performance – The high level API does not do what you want » It is necessary to use a lower level capability » Make use of some hardware feature “Computational Universality” An XOR Neural Network • The example of XOR nicely emphasizes the G(x) importance of hidden neurons: • They re-represent the input such that the problem becomes linearly separable.

Load more