CUDA and Opencl

Total Page:16

File Type:pdf, Size:1020Kb

CUDA and Opencl CUDACUDA andand OpenCLOpenCL ------ DevelopmentDevelopment InterfacesInterfaces forfor MulticoreMulticore ProgrammingProgramming Dr. Jun Ni, Ph.D. Associate Professor of Radiology, Biomedical Engineering, Mechanical Engineering, Computer Science The University of Iowa, Iowa City, Iowa, USA Dec. 23 Harbin Engineering University OutlineOutline CUDA,CUDA, NavidaNavida--basedbased ProgrammingProgramming environmentenvironment forfor GPGPUGPGPU OpenCLOpenCL,, openopen sourcesource programmingprogramming environmentenvironment forfor multicoremulticore processorprocessor systemssystems forfor Cell/BECell/BE andand GPUGPU IntroductionIntroduction toto CUDACUDA CUDACUDA (an(an acronymacronym forfor ComputeCompute UnifiedUnified DeviceDevice ArchitectureArchitecture)) a parallel computing architecture developed by NVIDIA CUDACUDA isis thethe computingcomputing engineengine inin NVIDIANVIDIA graphicsgraphics processingprocessing unitsunits ((GPUsGPUs)) accessible to software developers through industry standard programming languages ProgrammersProgrammers useuse 'C'C forfor CUDA'CUDA' (C(C withwith NVIDIANVIDIA extensions)extensions) compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU IntroductionIntroduction toto CUDACUDA CUDA architecture supports a range of computational interfaces OpenCL DirectCompute Third party wrappers Python Fortran Java Matlab The latest drivers all contain the necessary CUDA components CUDA works with all NVIDIA GPUs G8X series onwards: GeForce Quadro Tesla IntroductionIntroduction toto CUDACUDA CUDACUDA givesgives developersdevelopers accessaccess toto thethe nativenative instructioninstruction setset andand memorymemory ofof thethe parallelparallel computationalcomputational elementselements inin CUDACUDA GPUsGPUs.. IntroductionIntroduction toto CUDACUDA UsingUsing CUDA,CUDA, thethe latestlatest NVIDIANVIDIA GPUsGPUs effectivelyeffectively becomebecome openopen architecturesarchitectures likelike CPUsCPUs.. UnlikeUnlike CPUs,CPUs, GPUsGPUs havehave aa parallelparallel "many"many--core"core" architecture,architecture, eacheach corecore capablecapable ofof runningrunning thousandsthousands ofof threadsthreads simultaneouslysimultaneously -- ifif anan applicationapplication isis suitedsuited toto thisthis kindkind ofof anan architecture,architecture, thethe GPUGPU cancan offeroffer largelarge performanceperformance benefitsbenefits MultiMulti--threadsthreads computingcomputing becomesbecomes moremore efficientefficient withinwithin multicoremulticore architecturearchitecture IntroductionIntroduction toto CUDACUDA InIn thethe computercomputer gaminggaming industry,industry, inin additionaddition toto graphicsgraphics rendering,rendering, graphicsgraphics cardscards areare usedused inin gamegame physicsphysics calculationscalculations (physical(physical effectseffects likelike debris,debris, smoke,smoke, fire,fire, fluids)fluids) PhysX Bullet CUDACUDA hashas alsoalso beenbeen usedused toto accelerateaccelerate nonnon--graphicalgraphical applicationsapplications byby anan orderorder ofof magnitudemagnitude oror moremore computational biology Cryptography Other fields IntroductionIntroduction toto CUDACUDA Example:Example: BOINC distributed computing client CUDACUDA providesprovides bothboth aa lowlow levellevel APIAPI andand aa higherhigher levellevel APIAPI TheThe initialinitial CUDACUDA SDKSDK waswas mademade publicpublic onon 1515 FebruaryFebruary 2007,2007, forfor MicrosoftMicrosoft WindowsWindows andand LinuxLinux MacMac OSOS XX supportsupport waswas laterlater addedadded inin versionversion 2.0.2.0. whichwhich supersedessupersedes thethe betabeta releasedreleased FebruaryFebruary 14,14, 20082008 AdvantagesAdvantages CUDACUDA hashas severalseveral advantagesadvantages overover traditionaltraditional generalgeneral purposepurpose computationcomputation onon GPUsGPUs (GPGPU)(GPGPU) usingusing graphicsgraphics APIs.APIs. Scattered reads – code can read from arbitrary addresses in memory. Shared memory – CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higherhigher bandwidth than is possible using texture lookups Faster downloads and readbacks to and from the GPU Full support for integer and bitwise operations, including integer texture lookups. LimitationsLimitations andand ChallengesChallenges ItIt usesuses aa recursionrecursion--free,free, functionfunction--pointerpointer--freefree subsetsubset ofof thethe CC language,language, plusplus somesome simplesimple extensions.extensions. However,However, aa singlesingle processprocess mustmust runrun spreadspread acrossacross multiplemultiple disjointdisjoint memorymemory spaces,spaces, unlikeunlike otherother CC languagelanguage runtimeruntime environments.environments. TextureTexture renderingrendering isis notnot supported.supported. ForFor doubledouble precisionprecision (only(only supportedsupported inin newernewer GPUsGPUs likelike GTXGTX 260260[12][12])) therethere areare somesome deviationsdeviations fromfrom thethe IEEEIEEE 754754 standard:standard: roundround--toto--nearestnearest--eveneven isis thethe onlyonly supportedsupported roundingrounding modemode forfor reciprocal,reciprocal, division,division, andand squaresquare root.root. LimitationsLimitations andand ChallengesChallenges InIn singlesingle precisionprecision,, DenormalsDenormals andand signallingsignalling NaNsNaNs areare notnot supportedsupported OnlyOnly twotwo IEEEIEEE roundingrounding modesmodes areare supportedsupported (chop(chop andand roundround--toto--nearestnearest even)even) ThoseThose areare specifiedspecified onon aa perper--instructioninstruction basisbasis ratherrather thanthan inin aa controlcontrol wordword PrecisionPrecision ofof division/squaredivision/square rootroot isis slightlyslightly lowerlower thanthan singlesingle precision.precision. LimitationsLimitations andand ChallengesChallenges TheThe busbus bandwidthbandwidth andand latencylatency betweenbetween thethe CPUCPU andand thethe GPUGPU maymay bebe aa bottleneck.bottleneck. ThreadsThreads shouldshould bebe runningrunning inin groupsgroups ofof atat leastleast 3232 forfor bestbest performance,performance, withwith totaltotal numbernumber ofof threadsthreads numberingnumbering inin thethe thousands.thousands. LimitationsLimitations andand ChallengesChallenges BranchesBranches inin thethe programprogram codecode dodo notnot impactimpact performanceperformance significantly,significantly, providedprovided thatthat eacheach ofof 3232 threadsthreads takestakes thethe samesame executionexecution path;path; thethe SIMDSIMD executionexecution modelmodel becomesbecomes aa significantsignificant limitationlimitation forfor anyany inherentlyinherently divergentdivergent tasktask (e.g.(e.g. traversingtraversing aa spacespace partitioningpartitioning datadata structurestructure duringduring raytracingraytracing).). LimitationsLimitations andand ChallengesChallenges UnlikeUnlike OpenCLOpenCL,, CUDACUDA--enabledenabled GPUsGPUs areare onlyonly availableavailable fromfrom NVIDIANVIDIA ((GeForceGeForce 88 seriesseries andand above,above, QuadroQuadro andand TeslaTesla).). AMD/ATIAMD/ATI onlyonly supportssupports GPGPUGPGPU onon itsits highhigh-- endend chipschips ExampleExample C++C++ loadsloads aa texturetexture fromfrom anan imageimage intointo anan arrayarray onon thethe GPU:GPU: cudaArray* cu_array; texture<float, 2> tex; // Allocate array cudaChannelFormatDesc description = cudaCreateChannelDesc<float>(); cudaMallocArray(&cu_array, &description, width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height*sizeof(float), cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTextureToArray(tex, cu_array); // Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y *blockDim.y + threadIdx.y ; float c = tex2D(tex, x, y); odata[y*width+x ] = c; } ExampleExample inin Python:Python: thethe productproduct ofof twotwo arraysarrays onon thethe GPUGPU import pycuda.driver as drv import numpy import pycuda.autoinit mod = drv.SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i ] = a[i] * b[i]; } """) multiply_them = mod.get_function("multiply_them") a = numpy.random .randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them ( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1)) print dest-a*b ExampleExample inin Python:Python: matrixmatrix multiplicationmultiplication inin GPUGPU importimport numpynumpy fromfrom pycublaspycublas importimport CUBLASMatrixCUBLASMatrix AA == CUBLASMatrixCUBLASMatrix (( numpy.mat([[1,2,3],[4,5,6]],numpy.float32)numpy.mat([[1,2,3],[4,5,6]],numpy.float32) )) BB == CUBLASMatrixCUBLASMatrix(( numpy.mat([[2,3],[4,5],[6,7]],numpy.float32)numpy.mat([[2,3],[4,5],[6,7]],numpy.float32) )) CC == A*BA*B printprint C.np_matC.np_mat()() LanguageLanguage BindingsBindings UnlikeUnlike OpenCLOpenCL,, CUDACUDA--enabledenabled GPUsGPUs areare onlyonly availableavailable fromfrom NVIDIANVIDIA ((GeForceGeForce 88 seriesseries andand above,above, QuadroQuadro andand TeslaTesla).). AMD/ATIAMD/ATI onlyonly supportssupports
Recommended publications
  • What Is Stream Processing?
    Introduction to Stream Processing Trond Hagen SINTEF ICT, Applied Mathematics Geilo, January 2008 ICT 1 Schedule, Thursday 09:00 - 09:40 Introduction to stream processing Trond Hagen 09:50 - 10:30 Introduction to stream processing cont’d Trond Hagen 15:00-15-40 CUDA programming Johan Seland 15:50-16-30 CUDA programming cont’d Johan Seland 17:00-18-30 Examples of applications André Brodtkorb, Johan Seland, ICT 2 Schedule, Friday 09:00 - 09:45 Introduction to Cell BE Trond Hagen 10:00 - 10:45 Programming Cell BE André Brodtkorb 11:00-12:00 “Birds of a feather” – parallel processing Johan Seland Summary and discussion ICT 3 Thursday evening Don’t miss the quiz in the bar after the dinner! Chance to win a ~10 000,- NOK HPC graphics card sponsored by NVIDIA ICT 4 Outline Introduction to Stream Processing Introduction to Graphics Processing Units (GPUs) GPU Architecture GPU Programming Models Examples Looking Forward ICT 5 What is Stream Processing? A stream is a set of input and output data Stream processing is a series of operations (kernel functions) applied for each element in a stream Uniform streaming is most typical. One kernel at a time is applied to all elements of the stream Single Instruction Multiple Data (SIMD) ICT 6 Instruction-Based Processing Instructions Processor Data Cache Memory Memory operands (data) During processing, the data required for an instruction’s execution is loaded into the cache, if not already present. Very flexible model, but has the disadvantage that the data-sequence is completely driven by the instruction sequence, yielding inefficient performance for uniform operations on large data blocks.
    [Show full text]
  • Rapidmind: Portability Across Architectures and Its Limitations
    RapidMind: Portability across Architectures and its Limitations Iris Christadler and Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, D-85748 Garching bei M¨unchen, Germany Abstract. Recently, hybrid architectures using accelerators like GP- GPUs or the Cell processor have gained much interest in the HPC community. The \RapidMind Multi-Core Development Platform" is a programming environment that allows generating code which is able to seamlessly run on hardware accelerators like GPUs or the Cell processor and multi-core CPUs both from AMD and Intel. This paper describes the ports of three mathematical kernels to RapidMind which have been chosen as synthetic benchmarks and representatives of scientific codes. Performance of these kernels has been measured on various RapidMind backends (cuda, cell and x86) and compared to other hardware-specific implementations (using CUDA, Cell SDK and Intel MKL). The results give an insight into the degree of portability of RapidMind code and code performance across different architectures. 1 Introduction The vast computing horsepower which is offered by hardware accelerators and their usually good power efficiency has aroused interest of the high performance computing community in these devices. The first hybrid system which entered the Top500 list [1] was the TSUBAME cluster at Tokyo Institute of Technology in Japan. Several hundred Clearspeed cards were used to accelerate an Opteron based cluster; the system was ranked No. 9 in the Top500 list in November 2006. Already in June 2006, a sustained Petaflop/s application performance was firstly reached with the RIKEN MD-GRAPE 3 system in Japan, a special purpose system dedicated for molecular dynamics simulations.
    [Show full text]
  • Data-Parallel Programming with Intel Array Building Blocks (Arbb)
    Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Data-parallel programming with Intel Array Building Blocks (ArBB) Volker Weinberg * Leibniz Rechenzentrum der Bayerischen Akademie der Wissenschaften, Boltzmannstr. 1, D-85748 Garching b. München, Germany Abstract Intel Array Building Blocks is a high-level data-parallel programming environment designed to produce scalable and portable results on existing and upcoming multi- and many-core platforms. We have chosen several mathematical kernels - a dense matrix-matrix multiplication, a sparse matrix-vector multiplication, a 1-D complex FFT and a conjugate gradients solver - as synthetic benchmarks and representatives of scientific codes and ported them to ArBB. This whitepaper describes the ArBB ports and presents performance and scaling measurements on the Westmere-EX based system SuperMIG at LRZ in comparison with OpenMP and MKL. 1. Introduction Intel ArBB [1] is a combination of RapidMind and Intel Ct (“C for Throughput Computing”), a former research project started in 2007 by Intel to ease the programming of its future multi-core processors. RapidMind was a multi-core development platform which allowed the user to write portable code that was able to run on multi-core CPUs both from Intel and AMD as well as on hardware accelerators like GPGPUs from NVIDIA and AMD or the CELL processor. The platform was developed by RapidMind Inc., a company that started in 2004 based on the research related to the Sh project [2] at the University of Waterloo. Intel acquired RapidMind Inc. in August 2009 and combined the advantages of RapidMind with Intel’s Ct technology into a successor named “Intel Array Building Blocks”.
    [Show full text]
  • A Rough Guide to Scientific Computing on The
    SCOP3 A Rough Guide to Scientific Computing On the PlayStation 3 Technical Report UT-CS-07-595 Version 1.0 by Alfredo Buttari Piotr Luszczek Jakub Kurzak Jack Dongarra George Bosilca Innovative Computing Laboratory University of Tennessee Knoxville May 11, 2007 Contents 1 Introduction 1 2 Hardware 3 2.1 CELL Processor .................................... 3 2.1.1 POWER Processing Element (PPE) ...................... 3 2.1.2 Synergistic Processing Element (SPE) ..................... 5 2.1.3 Element Interconnection Bus (EIB) ....................... 6 2.1.4 Memory System ................................ 7 2.2 PlayStation 3 ...................................... 7 2.2.1 Network Card .................................. 7 2.2.2 Graphics Card ................................. 7 2.3 GigaBit Ethernet Switch ................................ 8 2.4 Power Consumption .................................. 8 3 Software 9 3.1 Virtualization Layer: Game OS ............................. 9 3.2 Linux Kernel ...................................... 9 3.3 Compilers ........................................ 10 ii CONTENTS CONTENTS 3.4 TCP/IP Stack ...................................... 11 3.5 MPI ........................................... 11 4 Cluster Setup 14 4.1 Basic Linux Installation ................................. 14 4.2 Linux Kernel Recompilation .............................. 16 4.3 IBM CELL SDK Installation ............................... 19 4.4 Network Configuration ................................. 22 4.5 MPI Installation ....................................
    [Show full text]
  • A Rough Guide to Scientific Computing on the Playstation 3
    SCOP3 A Rough Guide to Scientific Computing On the PlayStation 3 Technical Report UT-CS-07-595 Version 0.1 by Alfredo Buttari Piotr Luszczek Jakub Kurzak Jack Dongarra George Bosilca Innovative Computing Laboratory University of Tennessee Knoxville April 19, 2007 Contents 1 Introduction 1 2 Hardware 3 2.1 CELL Processor .................................... 3 2.1.1 POWER Processing Element (PPE) ...................... 3 2.1.2 Synergistic Processing Element (SPE) ..................... 5 2.1.3 Element Interconnection Bus (EIB) ....................... 6 2.1.4 Memory System ................................ 7 2.2 PlayStation 3 ...................................... 7 2.2.1 Network Card .................................. 7 2.2.2 Graphics Card ................................. 7 2.3 GigaBit Ethernet Switch ................................ 8 2.4 Power Consumption .................................. 8 3 Software 9 3.1 Virtualization Layer: Game OS ............................. 9 3.2 Linux Kernel ...................................... 9 3.3 Compilers ........................................ 10 ii CONTENTS CONTENTS 3.4 TCP/IP Stack ...................................... 11 3.5 MPI ........................................... 11 4 Cluster Setup 13 4.1 Basic Linux Installation ................................. 13 4.2 Linux Kernel Recompilation .............................. 15 4.3 IBM CELL SDK Installation ............................... 19 4.4 Network Configuration ................................. 21 4.5 MPI Installation ....................................
    [Show full text]
  • Parallel Programming Models
    Parallel Programming Models Dr. Mario Deilmann Intel Compiler and Languages Lab Software & Services Group, Developer Products Division Copyright © 2009, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Our Vision: Making models survive future architectures Single Source – portable across systems and into the future CPU w/Accelerator CPU SSE 4 LRBni AVX ManyCore Processors Hybrid Processors Software & Services Group, Developer Products Division Copyright © 2009, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Today’s Parallelism • There are lots of programming options TBB from Intel: Ct – Old recommendations (e.g. OpenMP) Cilk – Newer recommendations (e.g. TBB, Ct) RapidMind – New acquisitions (i.e. Cilk, RapidMind) OpenMP MPI – Competitive offerings (e.g. OpenCL, CUDA) OpenCL Xntask • An initiative (or a few initiatives) to CnC clarify the programming options for our customers Software & Services Group, Developer Products Division Copyright © 2009, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. POSIX* pthreads*: Example #include <stdio.h> #include <pthread.h> const int num_threads = 4 ; void* thread_func(void* arg) { do_work(); return NULL; Parallel processing } main() { pthread_t threads[num_threads]; for( int i = 0; i < num_threads; i++) pthread_create (&threads[i], NULL, thread_func, NULL); for(int j = 0; j < num_threads; j++) pthread_join (threads[j],
    [Show full text]
  • On the Effective Parallel Programming of Multi-Core Processors
    On the Effective Parallel Programming of Multi-core Processors PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 7 december 2010 om 10.00 uur door Ana Lucia VARBˆ ANESCUˇ Master of Science in Architecturi Avansate de Calcul, Universitatea POLITEHNICA Bucure¸sti, Romˆania geboren te Boekarest, Roemeni¨e. Dit proefschrift is goedgekeurd door de promotor: Prof. dr. ir. H.J. Sips Samenstelling promotiecommissie: Rector Magnificus voorzitter Prof.dr.ir. H.J. Sips Technische Universiteit Delft, promotor Prof.dr.ir. A.J.C. van Gemund Technische Universiteit Delft Prof.dr.ir. H.E. Bal Vrije Universiteit Amsterdam, The Netherlands Prof.dr.ir. P.H.J. Kelly Imperial College of London, United Kingdom Prof.dr.ing. N. T¸˜apu¸s Universitatea Politehnic˜aBucure¸sti, Romania Dr. R.M. Badia Barcelona Supercomputing Center, Spain Dr. M. Perrone IBM TJ Watson Research Center, USA Prof.dr. K.G. Langendoen Technische Universiteit Delft, reservelid This work was carried out in the ASCI graduate school. Advanced School for Computing and Imaging ASCI dissertation series number 223. The work in this dissertation was performed in the context of the Scalp project, funded by STW-PROGRESS. Parts of this work have been done in collaboration and with the material support of the IBM T.J. Watson Research Center, USA. Part of this work has been done in collaboration with the Barcelona Supercomputing Center, and supported by a HPC Europa mobility grant.
    [Show full text]
  • Data-Parallel Programming with Intel Array Building Blocks (Arbb)
    Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Data-parallel programming with Intel Array Building Blocks (ArBB) Volker Weinberg * Leibniz Rechenzentrum der Bayerischen Akademie der Wissenschaften, Boltzmannstr. 1, D-85748 Garching b. München, Germany Abstract Intel Array Building Blocks is a high-level data-parallel programming environment designed to produce scalable and portable results on existing and upcoming multi- and many-core platforms. We have chosen several mathematical kernels - a dense matrix-matrix multiplication, a sparse matrix-vector multiplication, a 1-D complex FFT and a conjugate gradients solver - as synthetic benchmarks and representatives of scientific codes and ported them to ArBB. This whitepaper describes the ArBB ports and presents performance and scaling measurements on the Westmere-EX based system SuperMIG at LRZ in comparison with OpenMP and MKL. 1. Introduction Intel ArBB [1] is a combination of RapidMind and Intel Ct (“C for Throughput Computing”), a former research project started in 2007 by Intel to ease the programming of its future multi-core processors. RapidMind was a multi-core development platform which allowed the user to write portable code that was able to run on multi-core CPUs both from Intel and AMD as well as on hardware accelerators like GPGPUs from NVIDIA and AMD or the CELL processor. The platform was developed by RapidMind Inc., a company that started in 2004 based on the research related to the Sh project [2] at the University of Waterloo. Intel acquired RapidMind Inc. in August 2009 and combined the advantages of RapidMind with Intel’s Ct technology into a successor named “Intel Array Building Blocks”.
    [Show full text]
  • The Playstation 3 for High Performance Scientific Computing
    The PlayStation 3 for High Performance Scientific Computing Jakub Kurzak Department of Computer Science, University of Tennessee Alfredo Buttari French National Institute for Research in Computer Science and Control (INRIA) Piotr Luszczek MathWorks, Inc. Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee Computer Science and Mathematics Division, Oak Ridge National Laboratory School of Mathematics & School of Computer Science, University of Manchester The heart of the Sony PlayStation 3, the technological advances. STI CELL processor, was not originally in- Aside from the laws of exponential growth, a tended for scientific number crunching, and the fundamental result in Very Large Scale Integration PlayStation 3 itself was not meant primarily to (VLSI) complexity theory states that, for certain tran- serve such purposes. Yet, both these items may sitive computations (in which any output may depend impact the High Performance Computing world. on any input), the time required to perform the com- This introductory article takes a closer look at putation is reciprocal to the square root of the chip the cause of this potential disturbance. area. To decrease the time, the cross section must be increased by the same factor, and hence the total area must be increased by the square of that fac- The Multi-Core Revolution tor [3]. What this means is that instead of build- ing a chip of area A, four chips can be built of area In 1965 Gordon E. Moore, a co-founder of Intel made A/4, each of them delivering half the performance, the empirical observation that the number of tran- resulting in a doubling of the overall performance.
    [Show full text]
  • Frameworks for Multi-Core Architectures: a Comprehensive Evaluation Using 2D/3D Image Registration
    This is the author’s version of the work. The definitive work was published in Proceedings of the 24th International Conference on Architecture of Computing Systems (ARCS), Lake Como, Italy, February 22-25, 2011. Frameworks for Multi-core Architectures: A Comprehensive Evaluation using 2D/3D Image Registration Richard Membarth1, Frank Hannig1, Jürgen Teich1, Mario Körner2, and Wieland Eckert2 1 Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany. {richard.membarth,hannig,teich}@cs.fau.de 2 Siemens Healthcare Sector, H IM AX, Forchheim, Germany. {mario.koerner,wieland.eckert}@siemens.com Abstract. The development of standard processors changed in the last years moving from bigger, more complex, and faster cores to putting several more simple cores onto one chip. This changed also the way pro- grams are written in order to leverage the processing power of multiple cores of the same processor. In the beginning, programmers had to di- vide and distribute the work by hand to the available cores and to man- age threads in order to use more than one core. Today, several frame- works exist to relieve the programmer from such tasks. In this paper, we present five such frameworks for parallelization on shared memory multi-core architectures, namely OpenMP, Cilk++, Threading Building Blocks, RapidMind, and OpenCL. To evaluate these frameworks, a real world application from medical imaging is investigated, the 2D/3D im- age registration. In an empirical study, a fine-grained data parallel and a coarse-grained task parallel parallelization approach are used to evaluate and estimate different aspects like usability, performance, and overhead of each framework.
    [Show full text]
  • Software Plattform Embedded Systems 2020
    SPES Software Plattform Embedded Systems 2020 - Beschreibung der Fallstudie „Multi-core and Many-core Evaluation“ - Version: 1.0 Projektbezeichnung SPES 2020 Verantwortlich Richard Membarth QS-Verantwortlich Mario Körner, Frank Hannig Erstellt am 18.06.2010 Zuletzt geändert 18.06.2010 16:09 Freigabestatus Vertraulich für Partner Projektöffentlich X Öffentlich Bearbeitungszustand in Bearbeitung vorgelegt X fertig gestellt Weitere Produktinformationen Erzeugung Richard Membarth Mitwirkend Frank Hannig, Mario Körner, Wieland Eckert Änderungsverzeichnis Änderung Geänderte Beschreibung der Änderung Autor Zustand Kapitel Nr. Datum Version 1 22.06.10 1.0 Alle Finale Reporterstellung Prüfverzeichnis Die folgende Tabelle zeigt einen Überblick über alle Prüfungen – sowohl Eigenprüfungen wie auch Prüfungen durch eigenständige Qualitätssicherung – des vorliegenden Dokumentes. Geprüfte Neuer Datum Anmerkungen Prüfer Version Produktzustand Contents 1 Evaluation Application and Criteria 7 1.1 2D/3D Image Registration . .7 1.2 Checklist . .9 1.3 Profiling . 10 1.4 Parallelization Approaches . 12 2 Multi-Core Frameworks 15 2.1 OpenMP . 15 2.2 Cilk++ . 29 2.3 Threading Building Blocks . 43 2.4 RapidMind . 57 2.5 OpenCL . 70 2.6 Discussion . 82 3 Many-Core Frameworks 89 3.1 RapidMind . 89 3.2 PGI Accelerator . 92 3.3 OpenCL . 99 3.4 CUDA . 105 3.5 Intel Ct . 112 3.6 Larrabee . 112 3.7 Related Frameworks . 112 3.7.1 Bulk-Synchronous GPU Programming . 112 3.7.2 HMPP Workbench . 113 3.7.3 Goose . 113 3.7.4 YDEL for CUDA . 113 3.8 Discussion . 113 4 Conclusion 117 Bibliography 121 3 Abstract In this study, different parallelization frameworks for standard shared memory multi-core processors as well as parallelization frameworks for many-core processors like graphics cards are evaluated.
    [Show full text]
  • Frameworks for GPU Accelerators: a Comprehensive Evaluation Using 2D/3D Image Registration
    This is the author’s version of the work. The definitive work was published in Proceedings of the 9th IEEE Symposium on Application Specific Processors (SASP), San Diego, CA, USA, June 5-6, 2011. Frameworks for GPU Accelerators: A Comprehensive Evaluation using 2D/3D Image Registration Richard Membarth, Frank Hannig, and Jürgen Teich Mario Körner and Wieland Eckert Department of Computer Science, Siemens Healthcare Sector, H IM AX, University of Erlangen-Nuremberg, Germany. Forchheim, Germany. {richard.membarth,hannig,teich}@cs.fau.de {mario.koerner,wieland.eckert}@siemens.com for graphics cards without bothering about low-level hardware Abstract—In the last decade, there has been a dramatic growth details. in research and development of massively parallel many-core Compared to the predominant task parallel processing architectures like graphics hardware, both in academia and industry. This changed also the way programs are written in model on standard shared memory multi-core processors, order to leverage the processing power of a multitude of cores where each core processes independent tasks, a data parallel on the same hardware. In the beginning, programmers had to processing model is used for many-core processing. The main use special graphics programming interfaces to express general source for data parallelism arises out of inherent parallelism purpose computations on graphics hardware. Today, several within loops, distributing the computation of different loop frameworks exist to relieve the programmer from such tasks. In this paper, we present five frameworks for parallelization on iterations to the available processing cores. Parallelization GPU Accelerators, namely RapidMind, PGI Accelerator, HMPP frameworks assist the programmer in extracting this kind of Workbench, OpenCL, and CUDA.
    [Show full text]