Experience with Parallel Computers at NASA Ames David H. Bailey November 14, 1991

Total Page:16

File Type:pdf, Size:1020Kb

Experience with Parallel Computers at NASA Ames David H. Bailey November 14, 1991 Exp erience with Parallel Computers at NASA Ames David H. Bailey Novemb er 14, 1991 Ref: Intl. J. of High Sp eed Computing,vol. 5, no. 1 (993), pg. 51{62. Abstract Beginning in 1988, the Numerical Aero dynamic Simulation (NAS) organization at NASA Ames ResearchCenter has studied the usability of highly parallel computers on computational uid dynamics and other aerophysics applications. Presently this organi- zation op erates a CM-2 with 32,768 no des and an Intel iPSC/860 with 128 no des. This note gives an overview of the exp erience in using these systems, highlights b oth strong p oints and weak p oints of each, and discusses what improvements will b e required in fu- ture highly parallel systems in order that they can gain acceptance in the mainstream of scienti c computing. The author is with the NAS Applied Research Branch atNASAAmesResearch Center, Mo ett Field, CA 94035. 1 Intro duction NASA Ames Research Center has long b een a trail blazer in the eld of high p erfor- mance scienti c computation, particularly for computational uid dynamics (CFD) and other aerophysics applications. Back in the 1970s NASA Ames was the home of the Illiac IV, which featured a p eak p erformance of approximately 250 MFLOPS. In 1982, NASA Ames to ok delivery of the rst two pro cessor CrayX-MP,which featured a p eak p erfor- mance of nearly 400 MFLOPS. In 1986 the newly formed Numerical Aero dynamic Simu- lation (NAS) organization at NASA Ames installed the rst full-memory Cray-2, with a p eak p erformance of 1.6 GFLOPS. Most recently, the NAS facility added in 1988 a Cray Y-MP with a p eak p erformance of 2.6 GFLOPS. By the late 1980s it b ecame clear that the \classical" sup ercomputer design epitomized by existing Cray systems was approaching maturity, and that highly parallel computer systems had the most p otential for achieving multi-TFLOPS levels of sustained p erfor- mance, whichwe pro ject will b e required b efore the end of the decade. In order to b etter understand these systems and to help advance this technology to the level required for mainstream sup ercomputing, the NAS Applied Research Branchwas organized in 1988. Scientists in this organization and its aliates p ort application programs to parallel com- puters, develop new algorithms and programming techniques, and study the p erformance of the resulting implementations [1, 2, 3, 4, 6, 7, 8, 9, 10 and 12]. The scop e and diversity of these pro jects can b e seen by the partial listing in Table 1. The NAS pro ject now op erates a Connection Machine 2 (CM-2) by Thinking Machines, Inc. and an iPSC/860 byIntel Scienti c Computers. There are plans to acquire signi - cantly more p owerful systems in the future. As of this date, the CM-2 has b een op erational for nearly three years and the iPSC/860 for over one year. This note gives a brief sum- mary of our exp erience and highlights b oth the strengths and the weaknesses that wehave observed in these two systems. Requirements and recommendations for future systems are then outlined. The Connection Machine The NAS CM-2, obtained in collab oration with the Defense Advanced Pro jects Re- search Asso ciation (DARPA), was installed in 1988 with 32,768 bit-serial pro cessors. Re- cently it was upgraded with 1024 64-bit oating p oint hardware and four gigabytes of main memory, or ab out twice that of the NAS Cray-2. The system also now has the \slice- wise" Fortran compiler, which treats the hardware as a SIMD array of 64-bit oating p oint pro cessing units. For the rst year or two, most applications on the CM-2 were co ded in *LISP,anex- tension of the LISP language for parallel pro cessing. More recently,however, programmers have started using the CM Fortran language, which is based on Fortran-90 [5]. CM Fortran has b een plagued by delays, and version 1.0 was just recently made available. One notable achievement on the CM-2 is the p orting of a variation of the NASA Ames \ARC3D" co de. This is a three dimensional implicit Navier-Stokes uid dynamics appli- cation and employs a numb er of state of the art techniques. On a 64 64 32 grid, this 2 Pro ject Researchers Y-MP CM-2 iPSC Multigrid (NAS b enchmark) Frederickson, Barszcz x x x Conj. gradient (NAS b enchmark) Schreib er, Simon x x x 3D FFT PDE (NAS b enchmark) Bailey,Frederickson x x x Integer sort (NAS b enchmark) Dagum x x x LU solver (NAS b enchmark) Fato ohi, Venkatakrishnan x x x Scalar p enta. (NAS b enchmark) Barszcz, Weeratunga x x x Blo ck tridiag. (NAS b enchmark) Barszcz, Weeratunga x x x INS3D (incomp. Navier Stokes) Fato ohi, Yoon x x x Isotropic turbulence simulation Wray, Rogallo x x x PSIM3 (particle metho d) McDonald x x PSICM (particle metho d) Dagum x F3D (ARC3D multi-zone) Barszcz, Chawala, Weeratunga x x CM3D (ARC3D derivative) Levit, Jesp erson x x ARC2D Weeratunga x Unstructured Euler solver Hammond, Barth, Venkatakrishnan x x Unstructured partitioning Simon x High precision vortex analysis Bailey, Krasny x Table 1: Overview of Parallel CFD ResearchatNASAAmes 3 co de now runs at 190 MFLOPS on 16,384 pro cessors [8]. The equivalent one pro cessor Y-MP rate for this co de is 128 MFLOPS. The strongest p ositive feature of the CM-2, from our p ersp ective, is TMC's adoption of a programming language based on Fortran-90. Some have questioned whether the Fortran-90 array constructs are suciently p owerful to encompass a large fraction of mo dern scienti c computation. Basically, this is the same as the question of to what extent the data parallel programming mo del encompasses mo dern scienti c computation. While we cannot sp eak for other facilities, it seems clear that a very large fraction of our applications can b e co ded e ectively using this mo del. Even some applications, such as unstructured grid computations, which app eared at rst to b e inappropriate for data parallel computation, have subsequently b een successfully p orted to the CM-2, although di erent data structures and implementations techniques have b een required [6]. Another p ositive feature of the CM-2 is that its system software is relatively stable. It app ears to b e largely free of the frequent system \crashes" and other diculties that plague some other highly parallel systems. One weak p oint of the CM-2 is that it was not designed from the b eginning for oating p oint computation, and the oating p oint pro cessors now in the system have b een incor- p orated in a somewhat inelegant fashion. While this has b een remedied somewhat with the recent hardware upgrade and the \slicewise" software, the limited bandwidth b etween oating p oint pro cessors and the asso ciated main memory mo dules remains a problem. Related to this is the limited bandwidth b etween no des. Even when using the hyp er- cub e network, data cannot b e transferred b etween no des at a rate commensurate with the oating p oint computation rate. And when the \router" (an alternate hardware facility for irregular communication) must b e used, the p erformance of the co de often drops to workstation levels. Another weakness of the CM-2 design is that until recently no more than four users could b e using the system at a given time, which is to o restrictivefordaytime op eration when scientists are developing and debugging their co des. This restriction stems from the fact that the hardware cannot b e partitioned any ner than a 8,192 no de section. This diculty has b een remedied by the intro duction of time sharing software that allows more than one job to b e executing in a given sector of the CM-2. However, the down side of this solution is that individual jobs must o ccupy more pro cessing no des than necessary, and thus their p erformance is reduced accordingly. The principal weakness with the CM-2 at the present time, however, is the CM Fortran compiler, which still has a numb er of bugs and frequently delivers disapp ointing p erfor- mance. Also, the arrayintrinsic functions are in several cases rather inecient. Partly this is due to the fact that writing an e ectiveFortran-90 compiler, including ecient arrayintrinsics, has proved to b e much more of a challenge than originally anticipated. Other vendors attempting to write Fortran-90 compilers are rep orted to b e having similar diculties. In anyevent, clearly the CM Fortran compiler must b e further improved. The weaknesses of the currentversion of the CM Fortran compiler, coupled with the massively parallel design of the hardware, result in a system that is often very sensitiveto 4 the algorithm and implementation technique selected. It is not unusual for a reasonably well conceived and well written co de to deliver single digit MFLOPS p erformance rates when rst executed. Often the programmer must try quite a numb er of di erent parallel algorithms and implementation schemes b efore nding one that delivers resp ectable p erfor- mance. Exp erienced CM-2 programmers have learned numerous \tricks" that can improve p erformance, but until recently TMC provided very little do cumentation available on these \tricks".
Recommended publications
  • PGI Fortran Guide
    PGI® User’s Guide Parallel Fortran, C and C++ for Scientists and Engineers The Portland Group® STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group ® retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics, Inc. and/or The Portland Group® and may be used or copied only in accordance with the terms of the license agreement ("EULA"). No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics, Inc and/or The Portland Group®. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this manual, STMicroelectronics was aware of a trademark claim. The designations have been printed in caps or initial caps. PGF95, PGF90, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. PGI CDK is a registered trademark of STMicroelectronics. *Other brands and names are the property of their respective owners.
    [Show full text]
  • The Portland Group
    ® PGI Compiler User's Guide Parallel Fortran, C and C++ for Scientists and Engineers Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Compiler User’s Guide Copyright © 2010-2011 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 2011, 11.0, December, 2010 Second Printing: Release 2011, 11.1, January, 2011 Third Printing: Release 2011, 11.2, February, 2011 Fourth Printing: Release 2011, 11.3, March, 2011 Fourth Printing: Release 2011, 11.4, April, 2011 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 1196151 Contents Preface .....................................................................................................................................
    [Show full text]
  • The Portland Group
    ® PGI Compiler Reference Manual Parallel Fortran, C and C++ for Scientists and Engineers Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Compiler Reference Manual Copyright © 2010-2011 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First printing: Release 2011, 11.0, December, 2010 Second Printing: Release 2011, 11.1, January 2011 Third Printing: Release 2011, 11.3, March 2011 Fourth Printing: Release 2011, 11.4, April 2011 Fifth Printing: Release 2011, 11.5, May 2011 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 111162228 Contents Preface .....................................................................................................................................
    [Show full text]
  • The Portland Group
    PGI Visual Fortran® 2013 Installation Guide Version 13.5 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PVF® Installation Guide Copyright © 2010-2013 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 2013, version 13.1, January 2013 Second Printing: Release 2013, version 13.2, February 2013 Third Printing: Release 2013, version 13.3, March 2013 Fourth Printing: Release 2013, version 13.4, April 2013 Fourth Printing: Release 2013, version 13.5, May 2013 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 07183206 Contents 1.
    [Show full text]
  • The Portland Group
    PGI® 2013 Release Notes Version 13.3 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®) makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from The Portland Group and/or its licensors and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, PGI Unified Binary, and PGCL are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of The Portland Group, Inc. PGI® 2013 Release Notes Copyright © 2013 The Portland Group, Inc. and STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 2013, version 13.1, January 2013 Second Printing: Release 2013, version 13.2, February 2013 Third Printing: Release 2013, version 13.3, March 2013 Technical support: [email protected] Sales: [email protected] Web: www.pgroup.com ID: 07135184 Contents 1.
    [Show full text]
  • Tuning C++ Applications for the Latest Generation X64 Processors with PGI Compilers and Tools
    Tuning C++ Applications for the Latest Generation x64 Processors with PGI Compilers and Tools Douglas Doerfler and David Hensinger Sandia National Laboratories Brent Leback and Douglas Miles The Portland Group (PGI) ABSTRACT: At CUG 2006, a cache oblivious implementation of a two dimensional Lagrangian hydrodynamics model of a single ideal gas material was presented. This paper presents further optimizations to this C++ application to allow packed, consecutive-element storage of vectors, some restructuring of loops containing neighborhood operations, and adding type qualifiers to some C++ pointer declarations to improve performance. In addition to restructuring of the application, analysis of the compiler-generated code resulted in improvements to the latest PGI C++ compiler in the area of loop-carried redundancy elimination, resolution of pointer aliasing conflicts, and vectorization of loops containing min and max reductions. These restructuring and compiler optimization efforts by PGI and Sandia have resulted in application speedups of 1.25 to 1.80 on the latest generation of x64 processors. KEYWORDS: Compiler, C, C++, Optimization, Vectorization, Performance Analysis, AMD Opteron™, Intel Core™ 2, Micro-architecture vector or SIMD units to increase FLOP rates in general 1. Introduction purpose processors. While first-generation AMD and Intel 64-bit x86 Although modern compilers do an admirable job of (x64) processors contain 128-bit wide Streaming SIMD optimization when provided with nothing other than the Extensions (SSE) registers, their 64-bit data paths and “–fast” switch, many times there are still significant floating-point units limit the performance benefit of performance gains to be obtained with detailed analysis vectorizing double-precision loops.
    [Show full text]
  • Supercomputing in Plain English Exercise #3: Arithmetic Operations in This Exercise, We’Ll Use the Same Conventions and Commands As in Exercises #1 and #2
    Supercomputing in Plain English Exercise #3: Arithmetic Operations In this exercise, we’ll use the same conventions and commands as in Exercises #1 and #2. You should refer back to the Exercise #1 and #2 descriptions for details on various Unix commands. You MUST complete Exercises #1 and #2 BEFORE starting Exercise #3. For Exercise #3, YOU ARE EXPECTED TO KNOW HOW TO ACCOMPLISH BASIC TASKS, based on your experiences with Exercises #1 and #2. In the exercise, you’ll benchmark various arithmetic operations, using various compilers and levels of compiler optimization. Specifically, you’ll benchmark using the following compilers: the GNU Fortran compiler, gfortran, for various optimization levels; the Intel Fortran compiler, ifort, for various optimization levels; the Portland Group Fortran compiler, pgf90, for various optimization levels. Here are the steps for this exercise: 1. Log in to the Linux cluster supercomputer (sooner.oscer.ou.edu). 2. Copy the ArithmeticOperations directory: % cp -r ~hneeman/SIPE2011_exercises/ArithmeticOperations/ ~/SIPE2011_exercises/ 3. Choose which language you want to use (C or Fortran90), and cd into the appropriate directory: % cd ~/SIPE2011_exercises/ArithmeticOperations/C/ OR: % cd ~/SIPE2011_exercises/ArithmeticOperations/Fortran90/ 4. Edit the batch script arithmetic_operations.bsub so that it contains your username and your e-mail address. 5. Compile, using the shell script named make_cmd (a shell script is a file containing a sequence of Unix commands), which in turn invokes the make command: % make_cmd If that doesn’t work, try this: % ./make_cmd 6. Submit the batch job: % bsub < arithmetic_operations.bsub 7. Once the batch job completes, examine the several output files to see the timings for your runs with executables created by the various compilers under the various levels of optimization.
    [Show full text]
  • PGI Visual Fortran Reference Manual
    ® PGI Visual Fortran Reference Manual Parallel Fortran for Scientists and Engineers Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Visual Fortran Reference Manual Copyright © 2010-2012 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First printing:
    [Show full text]
  • The GPU Computing Revolution
    The GPU Computing Revolution From Multi-Core CPUs to Many-Core Graphics Processors A Knowledge Transfer Report from the London Mathematical Society and Knowledge Transfer Network for Industrial Mathematics By Simon McIntosh-Smith Copyright © 2011 by Simon McIntosh-Smith Front cover image credits: Top left: Umberto Shtanzman / Shutterstock.com Top right: godrick / Shutterstock.com Bottom left: Double Negative Visual Effects Bottom right: University of Bristol Background: Serg64 / Shutterstock.com THE GPU COMPUTING REVOLUTION From Multi-Core CPUs To Many-Core Graphics Processors By Simon McIntosh-Smith Contents Page Executive Summary 3 From Multi-Core to Many-Core: Background and Development 4 Success Stories 7 GPUs in Depth 11 Current Challenges 18 Next Steps 19 Appendix 1: Active Researchers and Practitioner Groups 21 Appendix 2: Software Applications Available on GPUs 23 References 24 September 2011 A Knowledge Transfer Report from the London Mathematical Society and the Knowledge Transfer Network for Industrial Mathematics Edited by Robert Leese and Tom Melham London Mathematical Society, De Morgan House, 57–58 Russell Square, London WC1B 4HS KTN for Industrial Mathematics, Surrey Technology Centre, Surrey Research Park, Guildford GU2 7YG 2 THE GPU COMPUTING REVOLUTION From Multi-Core CPUs To Many-Core Graphics Processors AUTHOR Simon McIntosh-Smith is head of the Microelectronics Research Group at the Univer- sity of Bristol and chair of the Many-Core and Reconfigurable Supercomputing Conference (MRSC), Europe’s largest conference dedicated to the use of massively parallel computer architectures. Prior to joining the university he spent fifteen years in industry where he designed massively parallel hardware and software at companies such as Inmos, STMicro- electronics and Pixelfusion, before co-founding ClearSpeed as Vice-President of Architec- ture and Applications.
    [Show full text]
  • CUDA Fortran 2013
    CUDA Fortran 2013 Brent Leback The Portland Group [email protected] Why Fortran? . Rich legacy in the scientific community . Semantics – easier to vectorize/parallelize . Array descriptors . Modules . Fortran 2003 Today’s Fortran delivers many of the abstraction features of C++ without compromising performance Tesla C1060 Commodity Multicore x86 + Commodity Manycore GPUs 4 – 24 CPU Cores 240 GPU/Accelerator Cores Tesla C2050 “Fermi” 4 – 24 CPU Cores 448 GPU/Accelerator Cores Tesla K20 “Kepler” GPU Programming Constants The Program must: . Initialize/Select the GPU to run on . Allocate data on the GPU . Move data from host, or initialize data on GPU . Launch kernel(s) . Gather results from GPU . Deallocate data 6 What Does CUDA Fortran Look Like? attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L integer :: i, j, kb, k, tx, ty real, device, allocatable, dimension(:,:) :: real, shared :: Asub(16,16),Bsub(16,16) Adev,Bdev,Cdev tx = threadidx%x . ty = threadidx%y i = blockidx%x * 16 + tx allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) j = blockidx%y * 16 + ty Adev = A(1:N,1:M) Cij = 0.0 Bdev = B(1:M,1:L) do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) call mm_kernel <<<dim3(N/16,M/16),dim3(16,16)>>> Bsub(tx,ty) = B(kb+ty-1,j) ( Adev, Bdev, Cdev, N, M, L) call syncthreads() do k = 1,16 C(1:N,1:L) = Cdev Cij = Cij + Asub(tx,k) * Bsub(k,ty) deallocate ( Adev, Bdev, Cdev ) enddo call syncthreads() .
    [Show full text]
  • Opencl Actors-Adding Data Parallelism to Actor-Based Programming With
    OpenCL Actors { Adding Data Parallelism to Actor-based Programming with CAF Raphael Hiesgen∗1, Dominik Charousset1, and Thomas C. Schmidt1 1Hamburg University of Applied Sciences Internet Technologies Group Department Informatik, HAW Hamburg Berliner Tor 7, D-20099 Hamburg, Germany September 25, 2017 Abstract The actor model of computation has been designed for a seamless support of concurrency and dis- tribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework (CAF). This offers a high level interface for accessing any OpenCL device without leaving the actor paradigm. The new type of actor is integrated into the runtime environment of CAF and gives rise to transparent message passing in distributed systems on heterogeneous hardware. Following the actor logic in CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence operate in a multi-stage fashion on data resident at the GPU. Developers are thus enabled to build complex data parallel programs from primitives without leaving the actor paradigm, nor sacrificing performance. Our evaluations on commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear scaling behavior when offloading larger workloads. For sub-second duties, the efficiency of offloading was found to largely differ between devices. Moreover, our findings indicate a negligible overhead over programming with the native OpenCL API. Keywords: Actor Model, C++, GPGPU Computing, OpenCL, Coprocessor 1 Introduction The stagnating clock speed forced CPU manufacturers into steadily increasing the number of cores on commodity hardware to meet the ever-increasing demand for computational power.
    [Show full text]
  • PGI Compiler Reference Manual
    ® PGI Compiler Reference Manual Parallel Fortran, C and C++ for Scientists and Engineers Release 2012 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group® (PGI®), a wholly-owned subsidiary of STMicroelectronics, Inc., makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and/or The Portland Group and may be used or copied only in accordance with the terms of the end-user license agreement ("EULA"). PGI Workstation, PGI Server, PGI Accelerator, PGF95, PGF90, PGFORTRAN, and PGI Unified Binary are trademarks; and PGI, PGHPF, PGF77, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are property of their respective owners. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's or the end user's personal use without the express written permission of STMicroelectronics and/or The Portland Group. PGI® Compiler Reference Manual Copyright © 2010-2012 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First printing:
    [Show full text]