Exploring the Multiple-GPU Design Space

Total Page:16

File Type:pdf, Size:1020Kb

Exploring the Multiple-GPU Design Space Exploring the Multiple-GPU Design Space Dana Schaa and David Kaeli Department of Electrical and Computer Engineering Northeastern University {dschaa, kaeli}@ece.neu.edu Abstract of superscalar CPUs for certain classes of parallel applica- tions. However, obtaining peak GPU performance requires Graphics Processing Units (GPUs) have been growing in significant programming effort. For example, when running popularity due to their impressive processing capabilities, the matrix multiplication function in CUDA’s BLAS library and with general purpose programming languages such on a GeForce 8800 GTX (theoretical 368 peak GFLOPS) as NVIDIA’s CUDA interface, are becoming the platform we measured a peak execution of around 40 GFLOPS. To of choice in the scientific computing community. Previous begin to obtain better performance, researchers would hand- studies that used GPUs focused on obtaining significant tune their code to match the characteristics of the underlying performance gains from execution on a single GPU. These GPU hardware. Although tuning may be required in some studies employed low-level, architecture-specific tuning in cases to achieve high performance (such as aligning data for order to achieve sizeable benefits over multicore CPU exe- efficient reads in memory-bound applications), these types cution. of optimizations require hardware-specific knowledge and In this paper, we consider the benefits of running on multi- make the code less portable between generations of devices. ple (parallel) GPUs to provide further orders of performance Similarly, previous work has shown that the optimal data speedup. Our methodology allows developers to accurately layout varies across different CUDA software releases [20]. predict execution time for GPU applications while varying Instead of focusing on fine tuning applications for hard- the number and configuration of the GPUs, and the size ware, our approach is to utilize multiple GPUs to exploit of the input data set. This is a natural next step in GPU even larger degrees of parallelism. To show that execution computing because it allows researchers to determine the on multiple GPUs is beneficial, we introduce models for most appropriate GPU configuration for an application the various components of GPU execution and provide a without having to purchase hardware, or write the code methodology for predicting execution of GPU applications. for a multiple-GPU implementation. When used to predict Our methodology is designed to accurately predict the exe- performance on six scientific applications, our framework cution time of a given CUDA application (based on a single- produces accurate performance estimates (11% difference GPU implementation), while varying the number of GPUs, on average and 40% maximum difference in a single case) their configuration, and the data set size of the application. for a range of short and long running scientific programs. Using this technique, we can generate accurate performance estimates that allow users to determine the system and GPU configuration that best fits their cost-performance needs. 1. Introduction Execution on parallel GPUs is promising because appli- cations that are best suited to run on GPUs inherently have The benefits of using Graphics Processing Units (GPUs) large amounts of parallelism. Utilizing multiple GPUs avoids for general purpose programming has been recognized for dealing with many of the limitations of GPUs (e.g., on-chip some time, and many general purpose APIs have been memory resources) by exploiting the combined resources of created to abstract the graphics hardware from the program- multiple boards. Of course, inter-GPU communication be- mer [2], [10], [15]. General purpose GPU (GPGPU) pro- comes a new problem that we need to address, and involves gramming has become the scientific computing platform of considering the efficiency of the current communication choice mainly due to the availability of standard C libraries fabric provided on GPUs. Our resulting framework is both using NVIDIA’s CUDA programming interface running on effective and accurate in capturing the dynamics present NVIDIA GTX GPUs. Since its first release, a number of as we move from a single GPU to multiple GPUs. With efforts have explored how to reap large performance gains our work, GPU developers can determine potential speedups on CUDA-enabled GPUs [3], [7], [16], [18], [20], [21], [23]. gained from execution of their applications on any number One reason for this rapid adoption is that current gener- of GPUs without having to purchase expensive hardware or ation GPUs (theoretically approaching a TFLOP of compu- even write code to simulate a parallel implementation. tational power) have the potential to replace a large number The remainder of this paper is organized as follows. Section 2 presents previous GPGPU performance studies, the code size for a Lattice-Boltzman model from 2800 lines including some works that utilize multiple GPUs. Section 3 to 100 lines, while maintaining identical performance. is a brief overview of the CUDA programming model. Since Caravela [24] is a stream-based computing model that the factors affecting multiple-GPU execution have not been incorporates GPUs into GRID computing and uses them to discussed in previous work, in Section 4 we provide back- replace CPUs as computation devices. While the GRID is ground on the multiple-GPU design space, and also describe not an ideal environment for coordinated GPGPU execution our methodology. Section 5 discusses the applications used (due to the communication overheads we will discuss in to illustrate the power of our methodology. Results are Section 4), this model may have a niche for long-executing presented in Section 6. Finally, Section 7 concludes the applications with large memory requirements, as well as for paper. researchers who wish to run many batch GPU-based jobs. Given the growing interest in exploiting multiple GPUs, 2. Related Work and previous work that has demonstrated that performance gains are possible for specific applications, an in-depth Using GPUs for general purpose scientific computing analysis of the factors affecting multiple-GPU execution is has allowed a range of challenging problems to be solved lacking. The main contribution of our work is the creation of faster and has enabled researchers to study larger (e.g., models representing the significant aspects of parallel-GPU finer-grained) data sets [3], [7], [16], [18], [23]. In [21], execution (especially communication). These models can be Ryoo et. al provide a nice overview of the GeForce 8800 used to determine the best system configuration to deliver the GTX architecture, and also present strategies for optimizing required amount of performance for any application, while performance. We build upon some of the ideas presented taking into account factors such as hardware specifications in these prior papers by exploiting additional degrees of and input data size. Our work should accelerate the move available parallelism with the mapping of applications to to utilizing multiple GPUs in GPGPU computing. multiple GPUs. Our work shows that there can be great benefits when pursuing this path. 3. Parallel Computing with CUDA 2.1. Parallel GPUs Next, we provide a brief overview of the CUDA pro- gramming model as relevant to this work. We also discuss Even though most prior work has focused on algorithm issues with CUDA related to distributed and shared-system mapping techniques specific to performance on a single GPU computing. Those seeking more details on CUDA GPU, there are a number of efforts studying how to exploit programming should refer to the CUDA tutorials provided larger numbers of GPUs to accelerate specific problems. by Luebke, et. al [9]. To facilitate a discussion of the issues The Visualization Lab at Stony Brook University has a involved in utilizing multiple GPUs, we begin by formalizing 66-node cluster that contains GeForce FX5800 Ultra and some of the terminology in this work: Quadro FX4500 graphics cards that are used for both visu- • Distributed GPUs - A networked group of distributed alization and computation. Parallel algorithms implemented systems each containing a single GPU. on the cluster include medical reconstruction, particle flow, • Shared-system GPUs - A single system containing and dispersion algorithms [6], [17]. This prior work targets multiple GPUs that communicate through a shared CPU effective usage of a networked groups of distributed systems, RAM (such as the NVIDIA Tesla S870 server [13]). each containing a single GPU. • GPU-Parallel Execution - Execution that takes place Moerschell and Owens describe the implementation of a across multiple GPUs in parallel (as opposed to parallel distributed shared memory system to simplify execution on execution on a single GPU). This term is inclusive of multiple distributed GPUs [11]. In their work, they formulate both distributed and shared-system GPU execution. a memory consistency model to handle inter-GPU memory requests. By their own admission, the shortcoming of their 3.1. The CUDA Programming Model approach is that memory requests have a large impact on performance, and any abstraction where the programmer CUDA terminology refers to a GPU as the device, and a does not know where data is stored (i.e., on or off GPU) CPU as the host. These terms are used in the same manner is impractical for current GPGPU implementations. Still, as for the remainder of this paper. Next, we summarize CUDA’s GPUs begin to incorporate Scalable Link Interface (SLI) threading and memory models. technology for boards connected to the same system, this technique may prove promising. 3.1.1. Highly Threaded Environment. CUDA supports a In a related paper, Fan et. al explore how to utilize dis- large number of active threads and uses single-cycle context tributed GPU memories using object oriented libraries [5]. switches to hide datapath and memory-access latencies. For one particular application, they were able to decrease When running on NVIDIA’s G80 Series GPUs, threads are managed across 16 multiprocessors, each consisting of 3.2. GPU-Parallel Execution 8 single-instruction-multiple-data (SIMD) cores.
Recommended publications
  • CUDA by Example
    CUDA by Example AN INTRODUCTION TO GENERAL-PURPOSE GPU PROGRAMMING JASON SaNDERS EDWARD KANDROT Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Sanders_book.indb 3 6/12/10 3:15:14 PM Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. NVIDIA makes no warranty or representation that the techniques described herein are free from any Intellectual Property claims. The reader assumes all risk of any such claims based on his or her use of these techniques. The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales (800) 382-3419 [email protected] For sales outside the United States, please contact: International Sales [email protected] Visit us on the Web: informit.com/aw Library of Congress Cataloging-in-Publication Data Sanders, Jason.
    [Show full text]
  • The Intro to GPGPU CPU Vs
    12/12/11! The Intro to GPGPU . Dr. Chokchai (Box) Leangsuksun, PhD! Louisiana Tech University. Ruston, LA! ! CPU vs. GPU • CPU – Fast caches – Branching adaptability – High performance • GPU – Multiple ALUs – Fast onboard memory – High throughput on parallel tasks • Executes program on each fragment/vertex • CPUs are great for task parallelism • GPUs are great for data parallelism Supercomputing 20082 Education Program 1! 12/12/11! CPU vs. GPU - Hardware • More transistors devoted to data processing CUDA programming guide 3.1 3 CPU vs. GPU – Computation Power CUDA programming guide 3.1! 2! 12/12/11! CPU vs. GPU – Memory Bandwidth CUDA programming guide 3.1! What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana-Champaign! 3! 12/12/11! Why is GPGPU? • Large number of cores – – 100-1000 cores in a single card • Low cost – less than $100-$1500 • Green computing – Low power consumption – 135 watts/card – 135 w vs 30000 w (300 watts * 100) • 1 card can perform > 100 desktops 12/14/09!– $750 vs 50000 ($500 * 100) 7 Two major players 4! 12/12/11! Parallel Computing on a GPU • NVIDIA GPU Computing Architecture – Via a HW device interface – In laptops, desktops, workstations, servers • Tesla T10 1070 from 1-4 TFLOPS • AMD/ATI 5970 x2 3200 cores • NVIDIA Tegra is an all-in-one (system-on-a-chip) ATI 4850! processor architecture derived from the ARM family • GPU parallelism is better than Moore’s law, more doubling every year • GPGPU is a GPU that allows user to process both graphics and non-graphics applications.
    [Show full text]
  • NVIDIA Quadro Technical Specifications
    NVIDIA Quadro Technical Specifications NVIDIA Quadro Workstation GPU High-resolution Antialiasing ° Dassault CATIA • Full 128-bit floating point precision • Up to 16x full-scene antialiasing (FSAA), ° ESRI ArcGIS pipeline at resolutions up to 1920 x 1200 ° ICEM Surf • 12-bit subpixel precision • 12-bit subpixel sampling precision ° MSC.Nastran, MSC.Patran • Hardware-accelerated antialiased enhances AA quality ° PTC Pro/ENGINEER Wildfire, points and lines • Rotated-grid FSAA significantly 3Dpaint, CDRS The NVIDIA Quadro® family of In addition to a full line up of 2D and • Hardware OpenGL overlay planes increases color accuracy and visual ° SolidWorks • Hardware-accelerated two-sided quality for edges, while maintaining ° UDS NX Series, I-deas, SolidEdge, professional solutions for workstations 3D workstation graphics solutions, the lighting performance3 Unigraphics, SDRC delivers the fastest application NVIDIA Quadro professional products • Hardware-accelerated clipping planes and many more… Memory performance and the highest quality include a set of specialty solutions that • Third-generation occlusion culling • Digital Content Creation (DCC) graphics. have been architected to meet the • 16 textures per pixel • High-speed memory (up to 512MB Alias Maya, MOTIONBUILDER needs of a wide range of industry • OpenGL quad-buffered stereo (3-pin GDDR3) ° NewTek Lightwave 3D Raw performance and quality are only sync connector) • Advanced lossless compression ° professionals. These specialty Autodesk Media and Entertainment the beginning. The NVIDIA
    [Show full text]
  • 3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli
    3dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli Interviewed by: Shayne Hodge Recorded: July 29, 2013 Mountain View, California CHM Reference number: X6887.2013 © 2013 Computer History Museum 3dfx Oral History Panel Shayne Hodge: OK. My name is Shayne Hodge. This is July 29, 2013 at the afternoon in the Computer History Museum. We have with us today the founders of 3dfx, a graphics company from the 1990s of considerable influence. From left to right on the camera-- I'll let you guys introduce yourselves. Gary Tarolli: I'm Gary Tarolli. Scott Sellers: I'm Scott Sellers. Ross Smith: Ross Smith. Gordon Campbell: And Gordon Campbell. Hodge: And so why don't each of you take about a minute or two and describe your lives roughly up to the point where you need to say 3dfx to continue describing them. Tarolli: All right. Where do you want us to start? Hodge: Birth. Tarolli: Birth. Oh, born in New York, grew up in rural New York. Had a pretty uneventful childhood, but excelled at math and science. So I went to school for math at RPI [Rensselaer Polytechnic Institute] in Troy, New York. And there is where I met my first computer, a good old IBM mainframe that we were just talking about before [this taping], with punch cards. So I wrote my first computer program there and sort of fell in love with computer. So I became a computer scientist really. So I took all their computer science courses, went on to Caltech for VLSI engineering, which is where I met some people that influenced my career life afterwards.
    [Show full text]
  • GV-3D1-7950-RH Geforce™ 7950 GX2 Graphics Accelerator
    GV-3D1-7950-RH GeForce™ 7950 GX2 Graphics Accelerator User's Manual Rev. 101 12MD-3D17950R-101R * The WEEE marking on the product indicates this product must not be disposed of with user's other household waste and must be handed over to a designated collection point for the recycling of waste electrical and electronic equipment!! * The WEEE marking applies only in European Union's member states. Copyright © 2006 GIGABYTE TECHNOLOGY CO., LTD Copyright by GIGA-BYTE TECHNOLOGY CO., LTD. ("GBT"). No part of this manual may be reproduced or transmitted in any form without the expressed, written permission of GBT. Trademarks Third-party brands and names are the property of their respective owners. Notice Please do not remove any labels on VGA card, this may void the warranty of this VGA card. Due to rapid change in technology, some of the specifications might be out of date before publication of this booklet. The author assumes no responsibility for any errors or omissions that may appear in this document nor does the author make a commitment to update the information contained herein. Macrovision corporation product notice: This product incorporates copyright protection technology that is protected by U.S. patents and other intellectual property rights. Use of this copyright protection technology must be authorized by Macrovision, and is intended for home and other limited viewing uses only unless otherwise authorized by Macrovision. Reverse engineering or disassembly is prohibited. Table of Contents English 1. Introduction ......................................................................................... 3 1.1. Features ..................................................................................................... 3 1.2. Minimum system requirements ..................................................................... 3 2. Hardware Installation ........................................................................... 4 2.1.
    [Show full text]
  • Club 3D Geforce 6800 GS Pcie Brute Rendering Force
    Club 3D GeForce 6800 GS PCIe Brute rendering force... Introduction: The Club-3D GeForce 6800 GS is Pure Graphics Power for exceptional sharp pricing. This the right hardware to play your games with optimal qual- ity settings and high frame rates. With the Club 3D CyberLink PowerPack 6800 GS you have the correct 3D technology to play your games with all features enabled. Experience all the advanced and impressive shader effects that will present you light effects you have never seen before. The implemented SM3.0 technology creates exceptional natural environments, movements and colors. Order Information: • Club 3D 6800 GS 256MB : CGNX-GS686 Collin McRae 2005 DVD Product Positioning: • High Performance market • Game Enthousiast • LAN Enthousiast Extended Video Cable Specifications: Features: System requirements Item code: CGNX-GS686 • NVIDIA® CineFX™ 3.0 Technology • Intel® Pentium® or AMD™ Athlon™ • Full support for DirectX® 9.0 • 128MB of system memory Format: PCIe • NVIDIA® UltraShadow™ II Technology • Mainboard with free PCIe (x16) slot Engine Clock: 425MHz • 64-Bit Texture Filtering and Blending • CD-ROM drive for software installation Memory Clock: 1000MHz • VertexShaders 3.0 • 350Watt or greater Power Supply Memory: 256MB GDDR3 • PixelShaders 3.0 • 400Watt or greater when configured Memory Bus: 256 bit • Up to 16x Anisotropic Filtering in SLi nVidia Driver/E-manual Pixel Pipelines: 12 • Up to 6x Multi Sampling Anti Aliasing • Support for unlimited shader lengths Operating System Support RAMDAC: 2x 400MHz • DVI digital resolution up
    [Show full text]
  • Nvidia Maximus System Builder's Guide for Microsoft Windows 7 Or
    NVIDIA MAXIMUS SYSTEM BUILDER’S GUIDE FOR MICROSOFT WINDOWS 7 OR WINDOWS 8-64 DI-06471-001_v03 | January 2013 Installation Guide DOCUMENT CHANGE HISTORY DI-06471-001_v03 Version Date Authors Description of Change 01 August 4, 2012 OO, TS Initial Release 02 September 6, 2012 EK, TS Minor Edits 03 January 22, 2013 EKK, SM Updated to incorporate 2nd Generation Maximus NVIDIA Maximus System Builder’s Guide for Microsoft Windows 7 or Windows 8-64 DI-06471-001_v03 | ii TABLE OF CONTENTS NVIDIA 2nd Generation Maximus System Builder’s Guide .................................. 1 Audience ......................................................................................... 1 PreRequisite Skills .............................................................................. 2 Contents .......................................................................................... 2 Benefits of 2nd Generation Maximus Technology ........................................... 3 Installation ......................................................................................... 4 Required Components to Enable Maximus Technology .................................... 4 Quadro Professional Graphics .............................................................. 5 Tesla K20 ...................................................................................... 5 NVIDIA Quadro Professional Graphics Driver ............................................. 6 Implementing the Maximus Platform ...................................................... 6 Requirements for the
    [Show full text]
  • Hardware Pro Pocítacovou Grafiku
    Klasická zobrazovací zarízeníˇ Moderní zobrazovací zarízeníˇ Grafické adaptéry Odkazy Hardware pro pocítaˇ covouˇ grafiku Pavel Strachota FJFI CVUTˇ v Praze 24. záríˇ 2020 Klasická zobrazovací zarízeníˇ Moderní zobrazovací zarízeníˇ Grafické adaptéry Odkazy Obsah 1 Klasická zobrazovací zarízeníˇ 2 Moderní zobrazovací zarízeníˇ 3 Grafické adaptéry Klasická zobrazovací zarízeníˇ Moderní zobrazovací zarízeníˇ Grafické adaptéry Odkazy Obsah 1 Klasická zobrazovací zarízeníˇ 2 Moderní zobrazovací zarízeníˇ 3 Grafické adaptéry Klasická zobrazovací zarízeníˇ Moderní zobrazovací zarízeníˇ Grafické adaptéry Odkazy Obrazovky CRT 1/2 CRT - Cathode Ray Tube Princip cernobíléˇ televize: 1 svazek elektron˚uz elektronového delaˇ (katoda - vysoké napetíˇ cca 15-20kV) 2 je zaostrenˇ (konvergencníˇ mechanismus) 3 svazek je odklonenˇ magnetickým polem (cívky) na správné místo na stínítku 4 dopad elektron˚una luminofor („fosfor“) na stínítku vyvolá emisi svetlaˇ s exponenciálním útlumem intenzita elektronového svazku (pocetˇ elektron˚u) ∼ intenzita svetlaˇ elektronový paprsek vykresluje obraz po rádcíchˇ (vertikální) obnovovací frekvence (refresh rate) - pocetˇ snímk˚u(=pr˚uchod˚upaprsku celým obrazem) za sekundu horizontální obnovovací frekvence - pocetˇ vykreslených rádk˚uzaˇ sekundu Klasická zobrazovací zarízeníˇ Moderní zobrazovací zarízeníˇ Grafické adaptéry Odkazy Obrazovky CRT 2/2 Schéma cernobíléˇ televize Klasická zobrazovací zarízeníˇ Moderní zobrazovací zarízeníˇ Grafické adaptéry Odkazy Barevná CRT 1/2 3 druhy luminofor˚uusporádanéˇ do mrížky:ˇ Red, Green,
    [Show full text]
  • GV-N56GSO-1GI NVIDIA® Geforcetm GTX 560 Graphics Accelerator
    GV-N56GSO-1GI NVIDIA® GeForceTM GTX 560 Graphics Accelerator User's Manual Rev. 101 12MM-N56GSO-101AR Copyright © 2011 GIGABYTE TECHNOLOGY CO., LTD Copyright by GIGA-BYTE TECHNOLOGY CO., LTD. ("GBT"). No part of this manual may be reproduced or transmitted in any form without the expressed, written permission of GBT. Trademarks Third-party brands and names are the properties of their respective owners. Notice Please do not remove any labels on this graphics card. Doing so may void the warranty of this card. Due to rapid change in technology, some of the specifications might be out of date before publication of this this manual. The author assumes no responsibility for any errors or omissions that may appear in this document nor does the author make a commitment to update the information contained herein. Rovi Product Notice: This product incorporates copyright protection technology that is protected by U.S. patents and other intellectual property rights. Use of this copyright protection technology must be authorized by Rovi Corporation, and is intended for home and other limited viewing uses only unless otherwise authorized by Rovi Corporation. Reverse engineering or disassembly is prohibited. VGA Card GV-N56GSO-1GI VGA Card GV-N56GSO-1GI Jun. 22, 2011 Jun. 22, 2011 Table of Contents 1. Introduction ................................................................................................................ 4 1.1. Features ........................................................................................................................
    [Show full text]
  • Vysoke´Ucˇenítechnicke´V Brneˇ
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Digital library of Brno University of Technology VYSOKE´ UCˇ ENI´ TECHNICKE´ V BRNEˇ BRNO UNIVERSITY OF TECHNOLOGY FAKULTA INFORMACˇ NI´CH TECHNOLOGII´ U´ STAV POCˇ ´ITACˇ OVY´ CH SYSTE´ MU˚ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SYSTEMS AKCELERACE MIKROSKOPICKE´ SIMULACE DOPRAVY ZA POUZˇ ITI´ OPENCL DIPLOMOVA´ PRA´ CE MASTER’S THESIS AUTOR PRA´ CE ANDREJ URMINSKY´ AUTHOR BRNO 2011 VYSOKE´ UCˇ ENI´ TECHNICKE´ V BRNEˇ BRNO UNIVERSITY OF TECHNOLOGY FAKULTA INFORMACˇ NI´CH TECHNOLOGII´ U´ STAV POCˇ ´ITACˇ OVY´ CH SYSTE´ MU˚ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SYSTEMS AKCELERACE MIKROSKOPICKE´ SIMULACE DOPRAVY ZA POUZˇ ITI´ OPENCL ACCELERATION OF MICROSCOPIC URBAN TRAFFIC SIMULATION USING OPENCL DIPLOMOVA´ PRA´ CE MASTER’S THESIS AUTOR PRA´ CE ANDREJ URMINSKY´ AUTHOR VEDOUCI´ PRA´ CE Ing. PAVOL KORCˇ EK SUPERVISOR BRNO 2011 Abstrakt S narastajúcim poètom vozidiel na na¹ich cestách sa èoraz väčšmi stretávame so súèasnými problémami dopravy, medzi ktoré by sme mohli zaradi» poèetnej¹ie havárie, zápchy a zvýše- nie vypú¹»aných emisií CO2, ktoré zneèis»ujú životné prostredie. Na to, aby sme boli schopní efektívne využíva» cestnú infra¹truktúru, nám môžu poslúžiť napríklad simulátory dopravy. Pomocou takýchto simulátorov môžme vyhodnoti» vývoj premávky za rôznych poèiatoè- ných podmienok a tým vedie», ako sa správa» a reagova» v rôznych situáciách dopravy. Táto práca sa zaoberá témou akcelerácia mikroskopickej simulácie dopravy za použitia OpenCL. Akcelerácia simulácie je dôležitá pri potrebe analyzova» veľkú sie» infra¹truktúry, kde nám bežné spôsoby implementácie simulátorov nestaèia. Pre tento úèel je možné použiť naprí- klad techniku GPGPU súèasných grafických kariet, ktoré sú schopné paralelne vykonáva» v¹eobecné výpoèty.
    [Show full text]
  • Release 90 Notes
    ForceWare Graphics Drivers Release 90 Notes Version 91.45 For Windows XP / 2000 Windows XP Media Center Edition NVIDIA Corporation August 2006 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA, the NVIDIA logo, 3DFX, 3DFX INTERACTIVE, the 3dfx Logo, STB, STB Systems and Design, the STB Logo, the StarBox Logo, NVIDIA nForce, GeForce, NVIDIA Quadro, NVDVD, NVIDIA Personal Cinema, NVIDIA Soundstorm, Vanta,
    [Show full text]
  • Nvidia's GPU Microarchitectures
    NVidia’s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture - Kepler Architecture - Ma xwe ll Archite cture - Future Advances CPU vs GPU Architectures ● Few Cores ● Hundreds of Cores ● Lots of Cache ● Thousands of Threads ● Handful of Threads ● Single Proce ss Exe cution ● Independent Processes Use s of GPUs Gaming ● Gra phics ● Focuses on High Frames per Second ● Low Polygon Count ● Predefined Textures Workstation ● Computation ● Focuses on Floating Point Precision ● CAD Gra p hics - Billions of Polygons Brief Timeline of NVidia GPUs 1999 - World’s First GPU: GeForce 256 2001 - First Programmable GPU: GeForce3 2004 - Sca la ble Link Inte rfa ce 2006 - CUDA Architecture Announced 2007 - Launch of Tesla Computation GPUs with Tesla Microarchitecture 2009 - Fermi Architecture Introduced 2 0 13 - Kepler Architecture Launched 2 0 16 - Pascal Architecture Tesla - First microarchitecture to implement the unified shader model, which uses the same hardware resources for all fragment processing - Consists of a number of stream processors, which are scalar and can only operate on one component at a time - Increased clock speed in GPUs - Round robin scheduling for warps - Contained local, shared, and global memory - Conta ins Spe cia l-Function Units which are specialized for interpolating points - Allowed for two instructions to execute per clock cycle per SP Fermi Peak Performance Overview: ● Each SM has 32
    [Show full text]