Technical Paper

Total Page:16

File Type:pdf, Size:1020Kb

Technical Paper TECHNICAL PAPER Parallelism, Compute Intensity, and Data Vectorization: The CRAY APP Bradley R. Carlile Cray Research Superservers, Inc. 3601 SW Murray Blvd., Beaverton, Oregon 97005 [email protected] (503) 641-3151 (phone); (503) 641-4497 (fax) Abstract The CRAY APP is a general purpose, 84 processor, mul- tiple instruction multiple data (MIMD) shared-memory High performance on parallel algorithms requires high system [2] [19]. Utilization of commodity processors al- delivered memory bandwidth, fast computations, and min- imal parallel overheads. These three requirements have far lows it to be a very cost effective machine. It is a multi-user reaching ramifications on complete system design and per- compute server programmed using autoparallelizing FOR- formance. To satisfy the high computation rates of parallel TRAN or C in a Unix environment [1]. The peak perfor- programs, memory inefficiencies can be avoided by using mance is 6.7 Gflops for 32-bit computations and 3.4 Gflops knowledge of the applications data access patterns and the for 64-bit computations. The CRAY APP was designed as interaction of computations and data movement. Compute intensity (ratio of compute operations to memory accesses a production machine with an emphasis on ease-of-use. required) is central to the understanding parallel perfor- mance. Several other characteristics of parallel programs The CRAY APP uses commercial processors that can and techniques to exploit them will be discussed. One of issue multiple pipelined instructions to deliver fast compu- these techniques is data vectorization. Data vectorization tations in parallel programs. Loops are optimized on multi- focuses vectorization techniques on the data movement in a ple instruction issue processors using Software Pipelining code section. This and other techniques have been realized in the hardware and software design of the CRAY APP techniques [12] [23]. Software pipelining allows multiple shared-memory system. instruction issue processors to be viewed as efficient pro- grammable vector processors. The key to understanding high performance system de- 1.0 Introduction sign is understanding the characteristics of the important user applications. Memory usage is one of the most critical High performance on parallel programs depends on the and often overlooked characteristics of programs. This is following requirements. becoming more critical as the gap between processor speed and memory speed grows [9]. The memory bandwidth of a 1) High memory bandwidth system is also a major contributor to the price point of a 2) Fast computations system. At any particular memory bandwidth, efficient use of memory bandwidth can provide higher performance 3) Minimal parallel overheads than higher memory bandwidths that are used inefficiently. This paper will focus on several aspects of memory usage These three requirements have far reaching ramifica- and some parallel issues. tions for performance, ease of programming, programming model, optimization techniques, and suitable types of ap- 2.0 Memory Bandwidth plications. An understanding of these requirements and a hardware/software codesign process has led to the develop- Memory bandwidth is directly related to performance. ment of the CRAY APP shared memory system. Shared- The relationship between compute operations and data re- memory systems do not have split address spaces like dis- quired is called Compute Intensity [10] [11]. Others have tributed memory machines that require careful data distri- subsequently also defined the reciprocal of compute inten- bution for performance. In addition, automatic sity as R [6]. Compute intensity is defined as follows: parallelizing compilers for shared memory machines are a maturing technology. Technical Paper - Draft Submission 1 Submitted to: SuperComputing ‘93, Portland, November 1993. = ⁄ (4) Number of Operations Time Memory Accesses Memory Bandwidth Compute Intensity = (1) Number of Data Words Accessed Either equation (2) or (4) can be used to determine the percentage of memory bandwidth achieved on a given ap- For numerical computations, the operation count is usu- plication. The percentage of memory bandwidth delivered ally counted in terms of floating-point operations. Of is a particularly helpful metric when optimizing the perfor- course, it is equally valid to use an integer operation count mance of an application. The CRAY APP often delivers 60- for integer dominated computations. Most applications 90% of total memory bandwidth during the execution of have high compute intensity. High compute intensity is of- parallel programs. ten found in nested loops that reuse data. It is also found in calculations that perform complicated operations on data. Relative to the problem size, most algorithms have ei- ther constant compute intensity, log growth compute inten- The Compute Intensity of an algorithm can be used to sity, or linear growth compute intensity. Current cache- determine the performance bound of an application on a based processors have enough on-chip storage to often re- given memory system. This estimate is based on delivered alize moderate compute intensities of 4 to 30 for a wide va- memory bandwidth. riety of applications. Our experience is that half of the applications have loops with constant compute intensity = × Performance Intensity Memory Bandwidth (2) with moderate values. Table 1., contains an example of or each of these classes of compute intensities. Operations Operations Words = × (3) Data Compute Second Word Second Operation Words Intensity Algorithm Count Used (Ops/Word) This formula gives the maximum performance that the memory system can sustain for a given application. Even Sine 23N 2N 11.5 though this formula is completely independent of the float- 5 ing-point processing capabilities of a given machine it can Complex 1D FFT 5NNlog 2 4N log 2N often be a better measure. A different compiler focus or a 4 2 3 2 1 different algorithm implementation can often greatly in- Real Solver N 2N N crease the realized compute intensity of an application. In- 3 3 creases in compute intensity will be reflected in higher Table 1. Compute Intensities of Basic Algorithms execution performance on any memory bandwidth. The compute intensity in an application will often be dif- Most applications have a great deal of compute intensi- ferent for each basic code block (loop, nested loops, condi- ty. Even for small data sizes, many important algorithms tional, etc.) within an application. The compute intensity of exceed the design point of current small or large scale ar- each basic block is dependent on system architecture and chitectures. Most architectures have a much higher perfor- the compiler’s optimization strategy. If the program con- mance potential based on memory bandwidth. For example, sists of a linear sequence of basic blocks with different one dimensional FFTs of length 2k have a compute intensi- compute intensities, then a realized compute intensity, IR, ty of 13.75 (see Table 1.). Using this compute intensity and for the entire sequence is the weighted average of the com- equation (2) one could support 220 Gflops on the memory pute intensities of each block, Ib, multiplied by percentage bandwidth of a CRAY Y-MP/C90 (16 Gigawords/s to vec- of work in each basic block, Pb. tor units). However, for this algorithm the performance is limited to less than the peak computational rate of 16 n = × Gflops. A compiler could produce a code that has a com- IR ∑ Ib Pb (5) pute intensity of 1.0 and still achieve maximum perfor- b = 1 mance. The compute intensity and the percentage of work in each basic block is often dependent on the problem size of Another way to estimate performance is to base it on the an application. Frequently, the compute intensity will grow number of memory accesses required. Sometimes it is eas- with an increase in problem size. ier to estimate the required data accesses than the compute intensity. This estimate is most accurate when the memory bandwidth of an application is saturated. Technical Paper - Draft Submission 2 Submitted to: SuperComputing ‘93, Portland, November 1993. It is helpful to define another ratio called leverage to poor memory bandwidth utilization and thereby degrade quantify the data movement in a particular implementation. compute performance when implemented on cache-based Leverage is defined as follows: systems. The problems can be grouped into the three basic categories of cache miss handling (MISS), bandwidth Compute Time shortcomings (BW), and latency issues (LAT). These are Leaverage = (6) Data Movement Time shown with the associated causes in the table below. Leverage is directly related to compute intensity on a given machine. Compute time is related to the operation count and data movement time is related to the number of data points involved in the computation. An algorithm with set- a high compute intensity will often have a high leverage. line size write policy associativity However, it is possible to have an algorithm with a low Cache Problem (type) miss penalty compute intensity and a high leverage. This results either Non-Stride-1 slow (BW) yes yes no no when a calculation takes a long time to perform the floating Over-fetch (BW) yes no yes yes point operations or when many non-floating-point opera- Write BW Waste (BW) yes no no yes tions are performed. Interference (MISS) yes no yes yes Leverage can be used to explain how several processors Miss Stalls (MISS) yes yes no no can work in parallel to saturate the available memory band- Latency variance (LAT) no yes no yes width. For example, if a particular loop has a leverage of 11 it will spend only 9% of it’s execution time moving data. If Table 2. Cache Problems and Causes the computation is in a parallel region of code, eleven pro- cessors could be computing while one processor is moving Losing memory bandwidth is a chief concern in these data. In this way, twelve processors can saturate the mem- systems since the delivered “cache-friendly” stride-1 data ory bandwidth and maximize the performance achieved on fetching memory bandwidth of current commercial micro- the memory system.
Recommended publications
  • PGI Fortran Guide
    PGI® User’s Guide Parallel Fortran, C and C++ for Scientists and Engineers The Portland Group® STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group ® retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics, Inc. and/or The Portland Group® and may be used or copied only in accordance with the terms of the license agreement ("EULA"). No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics, Inc and/or The Portland Group®. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this manual, STMicroelectronics was aware of a trademark claim. The designations have been printed in caps or initial caps. PGF95, PGF90, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. PGI CDK is a registered trademark of STMicroelectronics. *Other brands and names are the property of their respective owners.
    [Show full text]
  • The Portland Group
    ® PGI Compiler User's Guide Parallel Fortran, C and C++ for Scientists and Engineers Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Compiler User’s Guide Copyright © 2010-2011 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 2011, 11.0, December, 2010 Second Printing: Release 2011, 11.1, January, 2011 Third Printing: Release 2011, 11.2, February, 2011 Fourth Printing: Release 2011, 11.3, March, 2011 Fourth Printing: Release 2011, 11.4, April, 2011 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 1196151 Contents Preface .....................................................................................................................................
    [Show full text]
  • The Portland Group
    ® PGI Compiler Reference Manual Parallel Fortran, C and C++ for Scientists and Engineers Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Compiler Reference Manual Copyright © 2010-2011 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First printing: Release 2011, 11.0, December, 2010 Second Printing: Release 2011, 11.1, January 2011 Third Printing: Release 2011, 11.3, March 2011 Fourth Printing: Release 2011, 11.4, April 2011 Fifth Printing: Release 2011, 11.5, May 2011 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 111162228 Contents Preface .....................................................................................................................................
    [Show full text]
  • The Portland Group
    PGI Visual Fortran® 2013 Installation Guide Version 13.5 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PVF® Installation Guide Copyright © 2010-2013 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 2013, version 13.1, January 2013 Second Printing: Release 2013, version 13.2, February 2013 Third Printing: Release 2013, version 13.3, March 2013 Fourth Printing: Release 2013, version 13.4, April 2013 Fourth Printing: Release 2013, version 13.5, May 2013 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 07183206 Contents 1.
    [Show full text]
  • The Portland Group
    PGI® 2013 Release Notes Version 13.3 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®) makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from The Portland Group and/or its licensors and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, PGI Unified Binary, and PGCL are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of The Portland Group, Inc. PGI® 2013 Release Notes Copyright © 2013 The Portland Group, Inc. and STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 2013, version 13.1, January 2013 Second Printing: Release 2013, version 13.2, February 2013 Third Printing: Release 2013, version 13.3, March 2013 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 07135184 Contents 1.
    [Show full text]
  • Tuning C++ Applications for the Latest Generation X64 Processors with PGI Compilers and Tools
    Tuning C++ Applications for the Latest Generation x64 Processors with PGI Compilers and Tools Douglas Doerfler and David Hensinger Sandia National Laboratories Brent Leback and Douglas Miles The Portland Group (PGI) ABSTRACT: At CUG 2006, a cache oblivious implementation of a two dimensional Lagrangian hydrodynamics model of a single ideal gas material was presented. This paper presents further optimizations to this C++ application to allow packed, consecutive-element storage of vectors, some restructuring of loops containing neighborhood operations, and adding type qualifiers to some C++ pointer declarations to improve performance. In addition to restructuring of the application, analysis of the compiler-generated code resulted in improvements to the latest PGI C++ compiler in the area of loop-carried redundancy elimination, resolution of pointer aliasing conflicts, and vectorization of loops containing min and max reductions. These restructuring and compiler optimization efforts by PGI and Sandia have resulted in application speedups of 1.25 to 1.80 on the latest generation of x64 processors. KEYWORDS: Compiler, C, C++, Optimization, Vectorization, Performance Analysis, AMD Opteron™, Intel Core™ 2, Micro-architecture vector or SIMD units to increase FLOP rates in general 1. Introduction purpose processors. While first-generation AMD and Intel 64-bit x86 Although modern compilers do an admirable job of (x64) processors contain 128-bit wide Streaming SIMD optimization when provided with nothing other than the Extensions (SSE) registers, their 64-bit data paths and “–fast” switch, many times there are still significant floating-point units limit the performance benefit of performance gains to be obtained with detailed analysis vectorizing double-precision loops.
    [Show full text]
  • Supercomputing in Plain English Exercise #3: Arithmetic Operations in This Exercise, We’Ll Use the Same Conventions and Commands As in Exercises #1 and #2
    Supercomputing in Plain English Exercise #3: Arithmetic Operations In this exercise, we’ll use the same conventions and commands as in Exercises #1 and #2. You should refer back to the Exercise #1 and #2 descriptions for details on various Unix commands. You MUST complete Exercises #1 and #2 BEFORE starting Exercise #3. For Exercise #3, YOU ARE EXPECTED TO KNOW HOW TO ACCOMPLISH BASIC TASKS, based on your experiences with Exercises #1 and #2. In the exercise, you’ll benchmark various arithmetic operations, using various compilers and levels of compiler optimization. Specifically, you’ll benchmark using the following compilers: the GNU Fortran compiler, gfortran, for various optimization levels; the Intel Fortran compiler, ifort, for various optimization levels; the Portland Group Fortran compiler, pgf90, for various optimization levels. Here are the steps for this exercise: 1. Log in to the Linux cluster supercomputer (sooner.oscer.ou.edu). 2. Copy the ArithmeticOperations directory: % cp -r ~hneeman/SIPE2011_exercises/ArithmeticOperations/ ~/SIPE2011_exercises/ 3. Choose which language you want to use (C or Fortran90), and cd into the appropriate directory: % cd ~/SIPE2011_exercises/ArithmeticOperations/C/ OR: % cd ~/SIPE2011_exercises/ArithmeticOperations/Fortran90/ 4. Edit the batch script arithmetic_operations.bsub so that it contains your username and your e-mail address. 5. Compile, using the shell script named make_cmd (a shell script is a file containing a sequence of Unix commands), which in turn invokes the make command: % make_cmd If that doesn’t work, try this: % ./make_cmd 6. Submit the batch job: % bsub < arithmetic_operations.bsub 7. Once the batch job completes, examine the several output files to see the timings for your runs with executables created by the various compilers under the various levels of optimization.
    [Show full text]
  • PGI Visual Fortran Reference Manual
    ® PGI Visual Fortran Reference Manual Parallel Fortran for Scientists and Engineers Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Visual Fortran Reference Manual Copyright © 2010-2012 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First printing:
    [Show full text]
  • The GPU Computing Revolution
    The GPU Computing Revolution From Multi-Core CPUs to Many-Core Graphics Processors A Knowledge Transfer Report from the London Mathematical Society and Knowledge Transfer Network for Industrial Mathematics By Simon McIntosh-Smith Copyright © 2011 by Simon McIntosh-Smith Front cover image credits: Top left: Umberto Shtanzman / Shutterstock.com Top right: godrick / Shutterstock.com Bottom left: Double Negative Visual Effects Bottom right: University of Bristol Background: Serg64 / Shutterstock.com THE GPU COMPUTING REVOLUTION From Multi-Core CPUs To Many-Core Graphics Processors By Simon McIntosh-Smith Contents Page Executive Summary 3 From Multi-Core to Many-Core: Background and Development 4 Success Stories 7 GPUs in Depth 11 Current Challenges 18 Next Steps 19 Appendix 1: Active Researchers and Practitioner Groups 21 Appendix 2: Software Applications Available on GPUs 23 References 24 September 2011 A Knowledge Transfer Report from the London Mathematical Society and the Knowledge Transfer Network for Industrial Mathematics Edited by Robert Leese and Tom Melham London Mathematical Society, De Morgan House, 57–58 Russell Square, London WC1B 4HS KTN for Industrial Mathematics, Surrey Technology Centre, Surrey Research Park, Guildford GU2 7YG 2 THE GPU COMPUTING REVOLUTION From Multi-Core CPUs To Many-Core Graphics Processors AUTHOR Simon McIntosh-Smith is head of the Microelectronics Research Group at the Univer- sity of Bristol and chair of the Many-Core and Reconfigurable Supercomputing Conference (MRSC), Europe’s largest conference dedicated to the use of massively parallel computer architectures. Prior to joining the university he spent fifteen years in industry where he designed massively parallel hardware and software at companies such as Inmos, STMicro- electronics and Pixelfusion, before co-founding ClearSpeed as Vice-President of Architec- ture and Applications.
    [Show full text]
  • CUDA Fortran 2013
    CUDA Fortran 2013 Brent Leback The Portland Group [email protected] Why Fortran? . Rich legacy in the scientific community . Semantics – easier to vectorize/parallelize . Array descriptors . Modules . Fortran 2003 Today’s Fortran delivers many of the abstraction features of C++ without compromising performance Tesla C1060 Commodity Multicore x86 + Commodity Manycore GPUs 4 – 24 CPU Cores 240 GPU/Accelerator Cores Tesla C2050 “Fermi” 4 – 24 CPU Cores 448 GPU/Accelerator Cores Tesla K20 “Kepler” GPU Programming Constants The Program must: . Initialize/Select the GPU to run on . Allocate data on the GPU . Move data from host, or initialize data on GPU . Launch kernel(s) . Gather results from GPU . Deallocate data 6 What Does CUDA Fortran Look Like? attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L integer :: i, j, kb, k, tx, ty real, device, allocatable, dimension(:,:) :: real, shared :: Asub(16,16),Bsub(16,16) Adev,Bdev,Cdev tx = threadidx%x . ty = threadidx%y i = blockidx%x * 16 + tx allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) j = blockidx%y * 16 + ty Adev = A(1:N,1:M) Cij = 0.0 Bdev = B(1:M,1:L) do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) call mm_kernel <<<dim3(N/16,M/16),dim3(16,16)>>> Bsub(tx,ty) = B(kb+ty-1,j) ( Adev, Bdev, Cdev, N, M, L) call syncthreads() do k = 1,16 C(1:N,1:L) = Cdev Cij = Cij + Asub(tx,k) * Bsub(k,ty) deallocate ( Adev, Bdev, Cdev ) enddo call syncthreads() .
    [Show full text]
  • Opencl Actors-Adding Data Parallelism to Actor-Based Programming With
    OpenCL Actors { Adding Data Parallelism to Actor-based Programming with CAF Raphael Hiesgen∗1, Dominik Charousset1, and Thomas C. Schmidt1 1Hamburg University of Applied Sciences Internet Technologies Group Department Informatik, HAW Hamburg Berliner Tor 7, D-20099 Hamburg, Germany September 25, 2017 Abstract The actor model of computation has been designed for a seamless support of concurrency and dis- tribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework (CAF). This offers a high level interface for accessing any OpenCL device without leaving the actor paradigm. The new type of actor is integrated into the runtime environment of CAF and gives rise to transparent message passing in distributed systems on heterogeneous hardware. Following the actor logic in CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence operate in a multi-stage fashion on data resident at the GPU. Developers are thus enabled to build complex data parallel programs from primitives without leaving the actor paradigm, nor sacrificing performance. Our evaluations on commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear scaling behavior when offloading larger workloads. For sub-second duties, the efficiency of offloading was found to largely differ between devices. Moreover, our findings indicate a negligible overhead over programming with the native OpenCL API. Keywords: Actor Model, C++, GPGPU Computing, OpenCL, Coprocessor 1 Introduction The stagnating clock speed forced CPU manufacturers into steadily increasing the number of cores on commodity hardware to meet the ever-increasing demand for computational power.
    [Show full text]
  • PGI Compiler Reference Manual
    ® PGI Compiler Reference Manual Parallel Fortran, C and C++ for Scientists and Engineers Release 2012 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Compiler Reference Manual Copyright © 2010-2012 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First printing:
    [Show full text]