Software Plattform Embedded Systems 2020

Total Page:16

File Type:pdf, Size:1020Kb

Software Plattform Embedded Systems 2020 SPES Software Plattform Embedded Systems 2020 - Beschreibung der Fallstudie „Multi-core and Many-core Evaluation“ - Version: 1.0 Projektbezeichnung SPES 2020 Verantwortlich Richard Membarth QS-Verantwortlich Mario Körner, Frank Hannig Erstellt am 18.06.2010 Zuletzt geändert 18.06.2010 16:09 Freigabestatus Vertraulich für Partner Projektöffentlich X Öffentlich Bearbeitungszustand in Bearbeitung vorgelegt X fertig gestellt Weitere Produktinformationen Erzeugung Richard Membarth Mitwirkend Frank Hannig, Mario Körner, Wieland Eckert Änderungsverzeichnis Änderung Geänderte Beschreibung der Änderung Autor Zustand Kapitel Nr. Datum Version 1 22.06.10 1.0 Alle Finale Reporterstellung Prüfverzeichnis Die folgende Tabelle zeigt einen Überblick über alle Prüfungen – sowohl Eigenprüfungen wie auch Prüfungen durch eigenständige Qualitätssicherung – des vorliegenden Dokumentes. Geprüfte Neuer Datum Anmerkungen Prüfer Version Produktzustand Contents 1 Evaluation Application and Criteria 7 1.1 2D/3D Image Registration . .7 1.2 Checklist . .9 1.3 Profiling . 10 1.4 Parallelization Approaches . 12 2 Multi-Core Frameworks 15 2.1 OpenMP . 15 2.2 Cilk++ . 29 2.3 Threading Building Blocks . 43 2.4 RapidMind . 57 2.5 OpenCL . 70 2.6 Discussion . 82 3 Many-Core Frameworks 89 3.1 RapidMind . 89 3.2 PGI Accelerator . 92 3.3 OpenCL . 99 3.4 CUDA . 105 3.5 Intel Ct . 112 3.6 Larrabee . 112 3.7 Related Frameworks . 112 3.7.1 Bulk-Synchronous GPU Programming . 112 3.7.2 HMPP Workbench . 113 3.7.3 Goose . 113 3.7.4 YDEL for CUDA . 113 3.8 Discussion . 113 4 Conclusion 117 Bibliography 121 3 Abstract In this study, different parallelization frameworks for standard shared memory multi-core processors as well as parallelization frameworks for many-core processors like graphics cards are evaluated. To evaluate the frameworks, not only perfor- mance numbers are considered, but also other criteria like scalability, productivity, and other techniques supported by the framework like pipelining. For evaluation, a computational intensive application from medical imaging, namely 2D/3D image registration is considered. In 2D/3D image registration, a preoperatively acquired volume is registered with an X-ray image. A 2D projection from the volume is generated and aligned with the X-ray image by means of translating and rotating the volume according to the three coordinate axes. This alignment step is repeated to find the best match between the projected 2D image and the X-ray image. To evaluate the parallelization frameworks, two parallelization strategies are con- sidered. One the one hand, fine-grained data parallelism, and on the other hand, coarse-grained task parallelism. We show that for most multi-core frameworks both strategies are applicable, whereas many-core frameworks support only fine-grained data parallelism. We compare relevant and widely used frameworks like OpenMP, Cilk++, Threading Building Blocks, RapidMind, and OpenCL for shared mem- ory multi-core architectures. These include Open Source as well as commercially available solutions. Similarly, for many-core architectures like graphics cards, we consider RapidMind, PGI Accelerator, OpenCL, and CUDA. The frameworks take different approaches to provide parallelization support for the programmer. These range from library solutions or directive based compiler extensions to language ex- tensions and completely new languages. 5 1 Evaluation Application and Criteria Felix, qui potuit rerum cognoscere causas. (Vergil) In this chapter, the medical application chosen for the evaluation, namely the 2D/3D image registration, will be introduced, as well the criteria for evaluation of the different frameworks. At the end, the implications from the profiling of the reference implementation for the evaluation are given as well as a description of the employed parallelization approaches. 1.1 2D/3D Image Registration In medical settings, images of the same modality or different modalities are of- ten needed in order to provide precise diagnoses. However, a meaningful usage of different images is only possible if the images are beforehand correctly aligned. Therefore, an image registration algorithm is deployed. In the investigated 2D/3D image registration, a previously stored volume is registered with an X-ray im- age [KDF+08, WPD+97]. The X-ray image results from the attenuation of the X-rays through an object from the source to the detector. Goal of the registration is to align the volume and the image. Therefore, the volume can be translated and rotated according to the three coordinate axes. For such a transformation an artificial X-ray image is generated by iterating through the transformed volume and calculating the attenuated intensity for each pixel. In order to evaluate the quality of the registration, the reconstructed X-ray image is compared with the original X-ray image using various similarity measures. To obtain the best alignment of the volume with the X-ray image, the parameters for the transformation are optimized until no further improvement is achieved. In this optimization step, for each eval- uation one artificial X-ray image is generated for the transformation parameters and compared with the original image. The similarity measures include sequential algorithms like the summation of values over the whole image and have in parts 7 1 Evaluation Application and Criteria Figure 1.1: Work flow of the 2D/3D image registration. random memory access patterns, for example for histogram generation. The work flow of the complete 2D/3D image registration as shown in Figure 1.1 consists of two major computational intensive parts. Firstly, a digitally recon- structed radiograph (DRR) is generated according to the parameter vector x = (tx; ty; tz; rx; ry; rz) describing the translation in mm along the axes and rotation according to the Rodrigues vector [TV98]. Ray casting is used to generate the radiograph from the volume, casting one ray for each pixel through the volume. On its way through the volume, the intensity of the ray is attenuated depending on the material it passes. A detailed description with mathematical formulas on how the attenuation is calculated can be found in [KDF+08]. Secondly, intensity- based similarity measures are calculated in order to evaluate how well the digitally reconstructed radiograph and the X-ray image match. We consider three similar- ity measures, namely sum of square differences (SSD), normalized cross correlation (NCC), and mutual information (MI). These similarity measures are weighted to asses the quality of the current parameter vector x. To align the two images best, optimization techniques are used to find the best parameter vector. Therefore, we use two different optimization techniques. In a first step local search is used to find the best parameter vector evaluating randomly changed parameter vectors. The changes to the parameter vector have to be within a predefined range, so that only parameter vectors similar to the input parameter vector are evaluated. The best of these parameter vectors is used in the second optimization technique, hill climb- ing. In hill climbing always one of the parameters in the vector is changed—once increased by a fixed value and once decreased by the same value. This is done for all parameters and the best parameter vector is taken afterwards for the next evaluation, now, using a smaller value to change the parameters. This is done until 8 1.2 Checklist no further improvement is found. 1.2 Checklist The following criteria have been chosen as being relevant for the evaluation of the multi-core and many-core frameworks. The criteria should be evaluated for each of the frameworks and, afterwards, a summary of the evaluation of all frameworks should be given. • Time required to get familiar with a particular framework: This includes the time to understand how the framework works and how the framework can be utilized for parallelization, as well as experimenting with some small examples. • Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: This item describes the effort that has to be invested in order to map the reference code to the new framework. This corresponds mainly to the re- quired time to rewrite parts of the source code, but also the effort to express sequential algorithms in parallel counterparts. • Support of parallelization by the framework: To what extend does the framework support the programmer to parallelize a program, for example, abstraction of available resources (cores), management of those resources, or support of auto-parallelization. • Support of data partitioning by the framework: Data is often tiled and processed by different threads and cores, respectively. Does the framework support to partition automatically data, or has this to be done by hand. • Applicability for different algorithm classes and target platforms: Which type of parallel algorithms can be expressed. We consider in particular the support of task parallelism and data parallelism. • Advantages, drawbacks, and difficulties of a particular framework: Here, the unique properties of the framework that distinguishes it from other frameworks are listed, but also problems encountered during employment of the framework. • Effort to get a certain result, for example, performance or throughput: How much effort has to be spent in order to achieve a given result. For instance this may be a speed of 3x using a system with four cores. 9 1 Evaluation Application and Criteria • Scalability of the framework with respect to
Recommended publications
  • What Is Stream Processing?
    Introduction to Stream Processing Trond Hagen SINTEF ICT, Applied Mathematics Geilo, January 2008 ICT 1 Schedule, Thursday 09:00 - 09:40 Introduction to stream processing Trond Hagen 09:50 - 10:30 Introduction to stream processing cont’d Trond Hagen 15:00-15-40 CUDA programming Johan Seland 15:50-16-30 CUDA programming cont’d Johan Seland 17:00-18-30 Examples of applications André Brodtkorb, Johan Seland, ICT 2 Schedule, Friday 09:00 - 09:45 Introduction to Cell BE Trond Hagen 10:00 - 10:45 Programming Cell BE André Brodtkorb 11:00-12:00 “Birds of a feather” – parallel processing Johan Seland Summary and discussion ICT 3 Thursday evening Don’t miss the quiz in the bar after the dinner! Chance to win a ~10 000,- NOK HPC graphics card sponsored by NVIDIA ICT 4 Outline Introduction to Stream Processing Introduction to Graphics Processing Units (GPUs) GPU Architecture GPU Programming Models Examples Looking Forward ICT 5 What is Stream Processing? A stream is a set of input and output data Stream processing is a series of operations (kernel functions) applied for each element in a stream Uniform streaming is most typical. One kernel at a time is applied to all elements of the stream Single Instruction Multiple Data (SIMD) ICT 6 Instruction-Based Processing Instructions Processor Data Cache Memory Memory operands (data) During processing, the data required for an instruction’s execution is loaded into the cache, if not already present. Very flexible model, but has the disadvantage that the data-sequence is completely driven by the instruction sequence, yielding inefficient performance for uniform operations on large data blocks.
    [Show full text]
  • Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe
    Letter to the Editor by parallelism author and expert James Reinders Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore by Intel software engineer Levent Akyil Welcome to the Parallel Universe Contents Think Parallel or Perish, BY JAMES REINDERS .........................................................................................2 James Reinders, Lead Evangelist and a Director with Intel® Software Development Products, sees a future where every software developer needs to be thinking about parallelism first when programming. He first published“ Think Parallel or Perish“ three years ago. Now he revisits his comments to offer an update on where we have gone and what still lies ahead. Parallelization Methodology...................................................................................................................... 4 The four stages of parallel application development addressed by Intel® Parallel Studio. Writing Parallel Code Safely, BY PETER VARHOL ........................................................................... 5 Writing multithreaded code to take full advantage of multiple processors and multicore processors is difficult. The new Intel® Parallel Studio should help us bridge that gap. Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore, BY LEVENT AKYIL .............................................. 8 A look at parallelization methods made possible by the new Intel® Parallel Studio—designed for Microsoft Visual Studio* C/C++ developers of Windows* applications.
    [Show full text]
  • Rapidmind: Portability Across Architectures and Its Limitations
    RapidMind: Portability across Architectures and its Limitations Iris Christadler and Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, D-85748 Garching bei M¨unchen, Germany Abstract. Recently, hybrid architectures using accelerators like GP- GPUs or the Cell processor have gained much interest in the HPC community. The \RapidMind Multi-Core Development Platform" is a programming environment that allows generating code which is able to seamlessly run on hardware accelerators like GPUs or the Cell processor and multi-core CPUs both from AMD and Intel. This paper describes the ports of three mathematical kernels to RapidMind which have been chosen as synthetic benchmarks and representatives of scientific codes. Performance of these kernels has been measured on various RapidMind backends (cuda, cell and x86) and compared to other hardware-specific implementations (using CUDA, Cell SDK and Intel MKL). The results give an insight into the degree of portability of RapidMind code and code performance across different architectures. 1 Introduction The vast computing horsepower which is offered by hardware accelerators and their usually good power efficiency has aroused interest of the high performance computing community in these devices. The first hybrid system which entered the Top500 list [1] was the TSUBAME cluster at Tokyo Institute of Technology in Japan. Several hundred Clearspeed cards were used to accelerate an Opteron based cluster; the system was ranked No. 9 in the Top500 list in November 2006. Already in June 2006, a sustained Petaflop/s application performance was firstly reached with the RIKEN MD-GRAPE 3 system in Japan, a special purpose system dedicated for molecular dynamics simulations.
    [Show full text]
  • Andrzej Nowak - Bio
    Multi-core Architectures Multi-core Architectures Andrzej Nowak - Bio 2005-2006 Intel Corporation IEEE 802.16d/e WiMax development Theme: Towards Reconfigggurable High-Performance Comppguting Linux kernel performance optimizations research Lecture 2 2006 Master Engineer diploma in Computer Science Multi-core Architectures Distributed Applications & Internet Systems Computer Systems Modeling 2007-2008 CERN openlab Andrzej Nowak Multi-core technologies CERN openlab (Geneva, Switzerland) Performance monitoring Systems architecture Inverted CERN School of Computing, 3-5 March 2008 1 iCSC2008, Andrzej Nowak, CERN openlab 2 iCSC2008, Andrzej Nowak, CERN openlab Multi-core Architectures Multi-core Architectures Introduction Objectives: Explain why multi-core architectures have become so popular Explain why parallelism is such a good bet for the near future Provide information about multi-core specifics Discuss the changes in computing landscape Discuss the impact of hardware on software Contents: Hardware part THEFREERIDEISOVERTHE FREE RIDE IS OVER Software part Recession looms? Outlook 3 iCSC2008, Andrzej Nowak, CERN openlab 4 iCSC2008, Andrzej Nowak, CERN openlab Towards Reconfigurable High-Performance Computing Lecture 2 iCSC 2008 3-5 March 2008, CERN Multi-core Architectures 1 Multi-core Architectures Multi-core Architectures Fundamentals of scalability Moore’s Law (1) Scalability – “readiness for enlargement” An observation made in 1965 by Gordon Moore, the co- founder of Intel Corporation: Good scalability: Additional
    [Show full text]
  • Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies
    Efficiency, energy efficiency and programming of accelerated HPC servers: Highlights of PRACE studies Lennart Johnsson Department of Computer Science University of Houston and School of Computer Science and Communications KTH To appear in Springer Verlag “GPU Solutions to Multi-scale Problems in Science and Engineering”, 2011 2 Lennart Johnsson Abstract During the last few years the convergence in architecture for High-Performance Computing systems that took place for over a decade has been replaced by a di- vergence. The divergence is driven by the quest for performance, cost- performance and in the last few years also energy consumption that during the life-time of a system have come to exceed the HPC system cost in many cases. Mass market, specialized processors, such as the Cell Broadband Engine (CBE) and Graphics Processors, have received particular attention, the latter especially after hardware support for double-precision floating-point arithmetic was intro- duced about three years ago. The recent support of Error Correcting Code (ECC) for memory and significantly enhanced performance for double-precision arithme- tic in the current generation of Graphic Processing Units (GPUs) have further so- lidified the interest in GPUs for HPC. In order to assess the issues involved in potentially deploying clusters with nodes consisting of commodity microprocessors with some type of specialized processor for enhanced performance or enhanced energy efficiency or both for science and engineering workloads, PRACE, the Partnership for Advanced Com- puting in Europe, undertook a study that included three types of accelerators, the CBE, GPUs and ClearSpeed, and tools for their programming. The study focused on assessing performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity.
    [Show full text]
  • Data-Parallel Programming with Intel Array Building Blocks (Arbb)
    Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Data-parallel programming with Intel Array Building Blocks (ArBB) Volker Weinberg * Leibniz Rechenzentrum der Bayerischen Akademie der Wissenschaften, Boltzmannstr. 1, D-85748 Garching b. München, Germany Abstract Intel Array Building Blocks is a high-level data-parallel programming environment designed to produce scalable and portable results on existing and upcoming multi- and many-core platforms. We have chosen several mathematical kernels - a dense matrix-matrix multiplication, a sparse matrix-vector multiplication, a 1-D complex FFT and a conjugate gradients solver - as synthetic benchmarks and representatives of scientific codes and ported them to ArBB. This whitepaper describes the ArBB ports and presents performance and scaling measurements on the Westmere-EX based system SuperMIG at LRZ in comparison with OpenMP and MKL. 1. Introduction Intel ArBB [1] is a combination of RapidMind and Intel Ct (“C for Throughput Computing”), a former research project started in 2007 by Intel to ease the programming of its future multi-core processors. RapidMind was a multi-core development platform which allowed the user to write portable code that was able to run on multi-core CPUs both from Intel and AMD as well as on hardware accelerators like GPGPUs from NVIDIA and AMD or the CELL processor. The platform was developed by RapidMind Inc., a company that started in 2004 based on the research related to the Sh project [2] at the University of Waterloo. Intel acquired RapidMind Inc. in August 2009 and combined the advantages of RapidMind with Intel’s Ct technology into a successor named “Intel Array Building Blocks”.
    [Show full text]
  • Design of a Parallel Multi-Threaded Programming Model for Multi-Core Processors
    DESIGN OF A PARALLEL MULTI-THREADED PROGRAMMING MODEL FOR MULTI-CORE PROCESSORS By Muhammad Ali Ismail Thesis submitted for the Degree of Doctor of Philosophy Department of Computer and Information Systems Engineering NED University of Engineering & Technology University Road, Karachi - 75270, Pakistan 2011 DESIGN OF A PARALLEL MULTI-THREADED PROGRAMMING MODEL FOR MULTI-CORE PROCESSORS PhD Thesis By Muhammad Ali Ismail Batch: 2008-2009 Project Advisor: Prof. Dr. Shahid Hafeez Mirza Project Co-supervisor: Prof. Dr. Talat Altaf 2011 Department of Computer and Information Systems Engineering NED University of Engineering & Technology University Road, Karachi - 75270, Pakistan Certificate Certified that the thesis entitled, “DEVELOPMENT OF A NEW PARALLEL MULTI-THREADED PROGRAMMING MODEL FOR MULTI-CORE PROCESSORS” which is being submitted by Mr. Muhammad Ali Ismail for the award of degree of Doctor of Philosophy in Computer & Information Systems Engineering Department of NED University of Engineering and Technology is a record of candidate’s own original work carried out by him under our supervision and guidance. The work incorporated in this thesis has not been submitted elsewhere for the award of any other degree. ___________________ _________________________ Prof. Dr. Talat Altaf, Prof. Dr. Shahid Hafeez Mirza Dean (ECE ), NEDUET Professor, UIT PhD Co-supervisor PhD Supervisor Acknowledgements In first place, I would like to thank the Almighty Allah for His countless blessings. In fact, all praise and glory belongs to Him and none has the right and worth to be worshipped but He. Next, I would like to acknowledge my home university, NED university of Engineering and Technology, for giving me the opportunity and funding for conducting this PhD research.
    [Show full text]
  • Download (633Kb)
    Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Intel, Intel Core and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Ct: A New Paradigm for Data Parallel Computing *Other names and brands may be claimed as the property of others. Hans–Christian Hoppe Intel Visual Computing Institute, Intel Labs Copyright © 2009. Intel Corporation. using material from http://intel.com/software/products Anwar Ghuloum, CJ Newburn, Michael McCool and Stefanus Du Toit Performance and Productivity Libraries, Developer Products Division, Software and Services Group Software & Services Group, Developer Products Division Software & Services Group, Developer Products Division Copyright © 2009, Intel Corporation. All rights reserved. Copyright © 2009, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
    [Show full text]
  • A Rough Guide to Scientific Computing on The
    SCOP3 A Rough Guide to Scientific Computing On the PlayStation 3 Technical Report UT-CS-07-595 Version 1.0 by Alfredo Buttari Piotr Luszczek Jakub Kurzak Jack Dongarra George Bosilca Innovative Computing Laboratory University of Tennessee Knoxville May 11, 2007 Contents 1 Introduction 1 2 Hardware 3 2.1 CELL Processor .................................... 3 2.1.1 POWER Processing Element (PPE) ...................... 3 2.1.2 Synergistic Processing Element (SPE) ..................... 5 2.1.3 Element Interconnection Bus (EIB) ....................... 6 2.1.4 Memory System ................................ 7 2.2 PlayStation 3 ...................................... 7 2.2.1 Network Card .................................. 7 2.2.2 Graphics Card ................................. 7 2.3 GigaBit Ethernet Switch ................................ 8 2.4 Power Consumption .................................. 8 3 Software 9 3.1 Virtualization Layer: Game OS ............................. 9 3.2 Linux Kernel ...................................... 9 3.3 Compilers ........................................ 10 ii CONTENTS CONTENTS 3.4 TCP/IP Stack ...................................... 11 3.5 MPI ........................................... 11 4 Cluster Setup 14 4.1 Basic Linux Installation ................................. 14 4.2 Linux Kernel Recompilation .............................. 16 4.3 IBM CELL SDK Installation ............................... 19 4.4 Network Configuration ................................. 22 4.5 MPI Installation ....................................
    [Show full text]
  • A Rough Guide to Scientific Computing on the Playstation 3
    SCOP3 A Rough Guide to Scientific Computing On the PlayStation 3 Technical Report UT-CS-07-595 Version 0.1 by Alfredo Buttari Piotr Luszczek Jakub Kurzak Jack Dongarra George Bosilca Innovative Computing Laboratory University of Tennessee Knoxville April 19, 2007 Contents 1 Introduction 1 2 Hardware 3 2.1 CELL Processor .................................... 3 2.1.1 POWER Processing Element (PPE) ...................... 3 2.1.2 Synergistic Processing Element (SPE) ..................... 5 2.1.3 Element Interconnection Bus (EIB) ....................... 6 2.1.4 Memory System ................................ 7 2.2 PlayStation 3 ...................................... 7 2.2.1 Network Card .................................. 7 2.2.2 Graphics Card ................................. 7 2.3 GigaBit Ethernet Switch ................................ 8 2.4 Power Consumption .................................. 8 3 Software 9 3.1 Virtualization Layer: Game OS ............................. 9 3.2 Linux Kernel ...................................... 9 3.3 Compilers ........................................ 10 ii CONTENTS CONTENTS 3.4 TCP/IP Stack ...................................... 11 3.5 MPI ........................................... 11 4 Cluster Setup 13 4.1 Basic Linux Installation ................................. 13 4.2 Linux Kernel Recompilation .............................. 15 4.3 IBM CELL SDK Installation ............................... 19 4.4 Network Configuration ................................. 21 4.5 MPI Installation ....................................
    [Show full text]
  • General Purpose Programming on Modern Graphics Hardware
    GENERAL PURPOSE PROGRAMMING ON MODERN GRAPHICS HARDWARE Robert Fleming, B.Sc. Thesis Prepared for the Degree of MASTER OF SCIENCE UNIVERSITY OF NORTH TEXAS May 2008 APPROVED: Robert Renka, Major Professor Armin Mikler, Committee Member Tom Jacob, Committee Member Krishna Kavi, Chair of the Department of Computer Science and Engineering Oscar Garcia, Dean of the College of Engineering Sandra L. Terrell, Dean of the Robert B. Toulouse School of Graduate Studies Fleming, Robert. General Purpose Programming on Modern Graphics Hardware. Master of Science (Computer Science), May 2008, 90 pp., 1 table, 3 figures, references, 124 titles. I start with a brief introduction to the graphics processing unit (GPU) as well as general-purpose computation on modern graphics hardware (GPGPU). Next, I explore the motivations for GPGPU programming, and the capabilities of modern GPUs (including advantages and disadvantages). Also, I give the background required for further exploring GPU programming, including the terminology used and the resources available. Finally, I include a comprehensive survey of previous and current GPGPU work, and end with a look at the future of GPU programming. Copyright 2008 by Robert Fleming ii To Wanda and Cheesepuff, my partners in crime. iii TABLE OF CONTENTS Page LIST OF TABLES AND ILLUSTRATIONS ......................................................................vi Chapters 1. MOTIVATION ............................................................................................ 1 1.1 What Kinds of Computation Suit the
    [Show full text]
  • CUDA and Opencl
    CUDACUDA andand OpenCLOpenCL ------ DevelopmentDevelopment InterfacesInterfaces forfor MulticoreMulticore ProgrammingProgramming Dr. Jun Ni, Ph.D. Associate Professor of Radiology, Biomedical Engineering, Mechanical Engineering, Computer Science The University of Iowa, Iowa City, Iowa, USA Dec. 23 Harbin Engineering University OutlineOutline CUDA,CUDA, NavidaNavida--basedbased ProgrammingProgramming environmentenvironment forfor GPGPUGPGPU OpenCLOpenCL,, openopen sourcesource programmingprogramming environmentenvironment forfor multicoremulticore processorprocessor systemssystems forfor Cell/BECell/BE andand GPUGPU IntroductionIntroduction toto CUDACUDA CUDACUDA (an(an acronymacronym forfor ComputeCompute UnifiedUnified DeviceDevice ArchitectureArchitecture)) a parallel computing architecture developed by NVIDIA CUDACUDA isis thethe computingcomputing engineengine inin NVIDIANVIDIA graphicsgraphics processingprocessing unitsunits ((GPUsGPUs)) accessible to software developers through industry standard programming languages ProgrammersProgrammers useuse 'C'C forfor CUDA'CUDA' (C(C withwith NVIDIANVIDIA extensions)extensions) compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU IntroductionIntroduction toto CUDACUDA CUDA architecture supports a range of computational interfaces OpenCL DirectCompute Third party wrappers Python Fortran Java Matlab The latest drivers all contain the necessary CUDA components CUDA works with all NVIDIA GPUs G8X series onwards: GeForce Quadro Tesla IntroductionIntroduction
    [Show full text]