Cuda C Programming Guide

Total Page:16

File Type:pdf, Size:1020Kb

Cuda C Programming Guide CUDA C PROGRAMMING GUIDE PG-02829-001_v5.0 | October 2012 Design Guide CHANGES FROM VERSION 4.2 ‣ Updated Texture Memory and Texture Functions with the new texture object API. ‣ Updated Surface Memory and Surface Functions with the new surface object API. ‣ Updated Concurrent Kernel Execution, Implicit Synchronization, and Overlapping Behavior for devices of compute capability 3.5. ‣ Removed from Table 2 Throughput of Native Arithmetic Instructions the possible code optimization of omitting __syncthreads() when synchronizing threads within a warp. ‣ Removed Synchronization Instruction for devices of compute capability 3.5. ‣ Updated __global__ to mention that __global__ functions are callable from the device for devices of compute capability 3.x (see the CUDA Dynamic Parallelism programming guide for more details). ‣ Mentioned in Device Memory Qualifiers that __device__, __shared__, and __constant__ variables can be declared as external variables when compiling in the separate compilation mode (see the nvcc user manual for a description of this mode). ‣ Mentioned memcpy and memset in Dynamic Global Memory Allocation and Operations. ‣ Added new functions sincospi(), sincospif(), normcdf(), normcdfinv(), normcdff(), and normcdfinvf() in Standard Functions. ‣ Updated the maximum ULP error for erfcinvf(), sin(), sinpi(), cos(), cospi(), and sincos() in Standard Functions. ‣ Added new intrinsic __frsqrt_rn() in Intrinsic Functions. ‣ Added new Section Callbacks on stream callbacks. www.nvidia.com CUDA C Programming Guide PG-02829-001_v5.0 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1 From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2 CUDA™: A General-Purpose Parallel Computing Platform and Programming Model............. 3 1.3 A Scalable Programming Model.........................................................................4 1.4 Document Structure......................................................................................5 Chapter 2. Programming Model............................................................................... 7 2.1 Kernels......................................................................................................7 2.2 Thread Hierarchy......................................................................................... 8 2.3 Memory Hierarchy.......................................................................................10 2.4 Heterogeneous Programming.......................................................................... 11 2.5 Compute Capability.....................................................................................12 Chapter 3. Programming Interface..........................................................................14 3.1 Compilation with NVCC................................................................................ 14 3.1.1 Compilation Workflow.............................................................................15 3.1.1.1 Offline Compilation.......................................................................... 15 3.1.1.2 Just-in-Time Compilation....................................................................15 3.1.2 Binary Compatibility...............................................................................16 3.1.3 PTX Compatibility..................................................................................16 3.1.4 Application Compatibility.........................................................................16 3.1.5 C/C++ Compatibility...............................................................................17 3.1.6 64-Bit Compatibility............................................................................... 17 3.2 CUDA C Runtime........................................................................................ 18 3.2.1 Initialization........................................................................................ 18 3.2.2 Device Memory..................................................................................... 19 3.2.3 Shared Memory.....................................................................................21 3.2.4 Page-Locked Host Memory........................................................................27 3.2.4.1 Portable Memory..............................................................................28 3.2.4.2 Write-Combining Memory....................................................................28 3.2.4.3 Mapped Memory.............................................................................. 28 3.2.5 Asynchronous Concurrent Execution............................................................ 29 3.2.5.1 Concurrent Execution between Host and Device........................................ 29 3.2.5.2 Overlap of Data Transfer and Kernel Execution......................................... 30 3.2.5.3 Concurrent Kernel Execution............................................................... 30 3.2.5.4 Concurrent Data Transfers.................................................................. 30 3.2.5.5 Streams.........................................................................................30 3.2.5.6 Events.......................................................................................... 34 3.2.5.7 Synchronous Calls.............................................................................34 3.2.6 Multi-Device System............................................................................... 34 3.2.6.1 Device Enumeration..........................................................................35 3.2.6.2 Device Selection.............................................................................. 35 www.nvidia.com CUDA C Programming Guide PG-02829-001_v5.0 | iii 3.2.6.3 Stream and Event Behavior................................................................. 35 3.2.6.4 Peer-to-Peer Memory Access................................................................36 3.2.6.5 Peer-to-Peer Memory Copy..................................................................36 3.2.7 Unified Virtual Address Space................................................................... 37 3.2.8 Error Checking......................................................................................37 3.2.9 Call Stack........................................................................................... 38 3.2.10 Texture and Surface Memory................................................................... 38 3.2.10.1 Texture Memory............................................................................. 38 3.2.10.2 Surface Memory............................................................................. 47 3.2.10.3 CUDA Arrays..................................................................................50 3.2.10.4 Read/Write Coherency..................................................................... 50 3.2.11 Graphics Interoperability........................................................................51 3.2.11.1 OpenGL Interoperability................................................................... 51 3.2.11.2 Direct3D Interoperability.................................................................. 53 3.2.11.3 SLI Interoperability......................................................................... 59 3.3 Versioning and Compatibility..........................................................................60 3.4 Compute Modes..........................................................................................61 3.5 Mode Switches........................................................................................... 62 3.6 Tesla Compute Cluster Mode for Windows.......................................................... 62 Chapter 4. Hardware Implementation......................................................................63 4.1 SIMT Architecture....................................................................................... 63 4.2 Hardware Multithreading...............................................................................64 Chapter 5. Performance Guidelines........................................................................ 66 5.1 Overall Performance Optimization Strategies...................................................... 66 5.2 Maximize Utilization.................................................................................... 66 5.2.1 Application Level.................................................................................. 66 5.2.2 Device Level........................................................................................ 67 5.2.3 Multiprocessor Level...............................................................................67 5.3 Maximize Memory Throughput........................................................................ 70 5.3.1 Data Transfer between Host and Device....................................................... 70 5.3.2 Device Memory Accesses......................................................................... 71 5.4 Maximize Instruction Throughput.....................................................................75 5.4.1 Arithmetic Instructions............................................................................76 5.4.2 Control Flow Instructions.........................................................................79 5.4.3 Synchronization Instruction.....................................................................
Recommended publications
  • GPU-Based Password Cracking on the Security of Password Hashing Schemes Regarding Advances in Graphics Processing Units
    Radboud University Nijmegen Faculty of Science Kerckhoffs Institute Master of Science Thesis GPU-based Password Cracking On the Security of Password Hashing Schemes regarding Advances in Graphics Processing Units by Martijn Sprengers [email protected] Supervisors: Dr. L. Batina (Radboud University Nijmegen) Ir. S. Hegt (KPMG IT Advisory) Ir. P. Ceelen (KPMG IT Advisory) Thesis number: 646 Final Version Abstract Since users rely on passwords to authenticate themselves to computer systems, ad- versaries attempt to recover those passwords. To prevent such a recovery, various password hashing schemes can be used to store passwords securely. However, recent advances in the graphics processing unit (GPU) hardware challenge the way we have to look at secure password storage. GPU's have proven to be suitable for crypto- graphic operations and provide a significant speedup in performance compared to traditional central processing units (CPU's). This research focuses on the security requirements and properties of prevalent pass- word hashing schemes. Moreover, we present a proof of concept that launches an exhaustive search attack on the MD5-crypt password hashing scheme using modern GPU's. We show that it is possible to achieve a performance of 880 000 hashes per second, using different optimization techniques. Therefore our implementation, executed on a typical GPU, is more than 30 times faster than equally priced CPU hardware. With this performance increase, `complex' passwords with a length of 8 characters are now becoming feasible to crack. In addition, we show that between 50% and 80% of the passwords in a leaked database could be recovered within 2 months of computation time on one Nvidia GeForce 295 GTX.
    [Show full text]
  • Version 7.8-Systemd
    Linux From Scratch Version 7.8-systemd Created by Gerard Beekmans Edited by Douglas R. Reno Linux From Scratch: Version 7.8-systemd by Created by Gerard Beekmans and Edited by Douglas R. Reno Copyright © 1999-2015 Gerard Beekmans Copyright © 1999-2015, Gerard Beekmans All rights reserved. This book is licensed under a Creative Commons License. Computer instructions may be extracted from the book under the MIT License. Linux® is a registered trademark of Linus Torvalds. Linux From Scratch - Version 7.8-systemd Table of Contents Preface .......................................................................................................................................................................... vii i. Foreword ............................................................................................................................................................. vii ii. Audience ............................................................................................................................................................ vii iii. LFS Target Architectures ................................................................................................................................ viii iv. LFS and Standards ............................................................................................................................................ ix v. Rationale for Packages in the Book .................................................................................................................... x vi. Prerequisites
    [Show full text]
  • Desarrollo Del Juego Sky Fighter Mediante XNA 3.1 Para PC
    Departamento de Informática PROYECTO FIN DE CARRERA Desarrollo del juego Sky Fighter mediante XNA 3.1 para PC Autor: Íñigo Goicolea Martínez Tutor: Juan Peralta Donate Leganés, abril de 2011 Proyecto Fin de Carrera Alumno: Íñigo Goicolea Martínez Sky Fighter Tutor: Juan Peralta Donate Agradecimientos Este proyecto es la culminación de muchos meses de trabajo, y de una carrera a la que llevo dedicando más de cinco años. En estas líneas me gustaría recordar y agradecer a todas las personas que me han permitido llegar hasta aquí. En primer lugar a mis padres, Antonio y Lola, por el apoyo que me han dado siempre. Por creer en mí y confiar en que siempre voy a ser capaz de salir adelante y no dudar jamás de su hijo. Y lo mismo puedo decir de mis dos hermanos, Antonio y Manuel. A Juan Peralta, mi tutor, por darme la oportunidad de realizar este proyecto que me ha permitido acercarme más al mundo de los videojuegos, algo en lo que querría trabajar. Pese a que él también estaba ocupado con su tesis doctoral, siempre ha sacado tiempo para resolver dudas y aportar sugerencias. A Sergio, Antonio, Toño, Alberto, Dani, Jorge, Álvaro, Fernando, Marta, Carlos, otro Antonio y Javier. Todos los compañeros, y amigos, que he hecho y que he tenido a lo largo de la carrera y gracias a los cuales he podido llegar hasta aquí. Por último, y no menos importante, a los demás familiares y amigos con los que paso mucho tiempo de mi vida, porque siempre están ahí cuando hacen falta.
    [Show full text]
  • Operating Systems and Applications for Embedded Systems >>> Toolchains
    >>> Operating Systems And Applications For Embedded Systems >>> Toolchains Name: Mariusz Naumowicz Date: 31 sierpnia 2018 [~]$ _ [1/19] >>> Plan 1. Toolchain Toolchain Main component of GNU toolchain C library Finding a toolchain 2. crosstool-NG crosstool-NG Installing Anatomy of a toolchain Information about cross-compiler Configruation Most interesting features Sysroot Other tools POSIX functions AP [~]$ _ [2/19] >>> Toolchain A toolchain is the set of tools that compiles source code into executables that can run on your target device, and includes a compiler, a linker, and run-time libraries. [1. Toolchain]$ _ [3/19] >>> Main component of GNU toolchain * Binutils: A set of binary utilities including the assembler, and the linker, ld. It is available at http://www.gnu.org/software/binutils/. * GNU Compiler Collection (GCC): These are the compilers for C and other languages which, depending on the version of GCC, include C++, Objective-C, Objective-C++, Java, Fortran, Ada, and Go. They all use a common back-end which produces assembler code which is fed to the GNU assembler. It is available at http://gcc.gnu.org/. * C library: A standardized API based on the POSIX specification which is the principle interface to the operating system kernel from applications. There are several C libraries to consider, see the following section. [1. Toolchain]$ _ [4/19] >>> C library * glibc: Available at http://www.gnu.org/software/libc. It is the standard GNU C library. It is big and, until recently, not very configurable, but it is the most complete implementation of the POSIX API. * eglibc: Available at http://www.eglibc.org/home.
    [Show full text]
  • A Data-Driven Approach for Personalized Drama Management
    A DATA-DRIVEN APPROACH FOR PERSONALIZED DRAMA MANAGEMENT A Thesis Presented to The Academic Faculty by Hong Yu In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology August 2015 Copyright © 2015 by Hong Yu A DATA-DRIVEN APPROACH FOR PERSONALIZED DRAMA MANAGEMENT Approved by: Dr. Mark O. Riedl, Advisor Dr. David Roberts School of Interactive Computing Department of Computer Science Georgia Institute of Technology North Carolina State University Dr. Charles Isbell Dr. Andrea Thomaz School of Interactive Computing School of Interactive Computing Georgia Institute of Technology Georgia Institute of Technology Dr. Brian Magerko Date Approved: April 23, 2015 School of Literature, Media, and Communication Georgia Institute of Technology To my family iii ACKNOWLEDGEMENTS First and foremost, I would like to express my most sincere gratitude and appreciation to Mark Riedl, who has been my advisor and mentor throughout the development of my study and research at Georgia Tech. He has been supportive ever since the days I took his Advanced Game AI class and entered the Entertainment Intelligence lab. Thanks to him I had the opportunity to work on the interactive narrative project which turned into my thesis topic. Without his consistent guidance, encouragement and support, this dissertation would never have been successfully completed. I would also like to gratefully thank my dissertation committee, Charles Isbell, Brian Magerko, David Roberts and Andrea Thomaz for their time, effort and the opportunities to work with them. Their expertise, insightful comments and experience in multiple research fields have been really beneficial to my thesis research.
    [Show full text]
  • Cuda Compiler Driver Nvcc
    CUDA COMPILER DRIVER NVCC TRM-06721-001_v10.1 | August 2019 Reference Guide CHANGES FROM PREVIOUS VERSION ‣ Major update to the document to reflect recent nvcc changes. www.nvidia.com CUDA Compiler Driver NVCC TRM-06721-001_v10.1 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. Overview................................................................................................... 1 1.1.1. CUDA Programming Model......................................................................... 1 1.1.2. CUDA Sources........................................................................................ 1 1.1.3. Purpose of NVCC.................................................................................... 2 1.2. Supported Host Compilers...............................................................................2 Chapter 2. Compilation Phases................................................................................3 2.1. NVCC Identification Macro.............................................................................. 3 2.2. NVCC Phases............................................................................................... 3 2.3. Supported Input File Suffixes...........................................................................4 2.4. Supported Phases......................................................................................... 4 Chapter 3. The CUDA Compilation Trajectory............................................................
    [Show full text]
  • Author Guidelines for 8
    ACCELERATED CODE GENERATOR FOR PROCESSING OCEAN COLOR REMOTE SENSING DATA ON GPU Jae-Moo Heo1, Gangwon Jo2, Hee-Jeong Han1, Hyun Yang 1 1Korea Ocean Satellite Center, Korea Institute of Ocean Science and Technology, Pusan, South Korea 2Seoul National University and ManyCoreSoft Co., Ltd., Seoul, South Korea ABSTRACT GOCI-II ground system in July 2017 and decided to process As the satellite data gradually increases, the processing time algorithms on the Graphics Processing Unit (GPU) in order of the algorithm for remote sensing also increases. It is now to process large amounts of data. essential to make efforts to improve the processing Nowadays, we can take advantage of computation devices performance in various accelerators such as Graphics of various architectures: Many Integrated Core (i.e., Intel Processing Unit (GPU). However, it is very difficult to use Xeon Phi), GPU, and Filed Programmable Gate Array programming models that make programs to run on (FPGA). These accelerators are constantly improving and accelerators. Especially for scientists, easy programming changing. And, parallel programming models that are and performance are often more important than providing compatible with these various accelerators are being many optimization functions and directives. To meet these developed at the same time. However, algorithm source requirements, we developed the accelerated code generator codes have been generally improved for several years or based on Open Computing Language (OpenCL) for several decades and it is not easy to develop a large amount processing ocean color remote sensing data. We easily of these source codes into the existing parallel programming applied our generator to atmospheric correction program for models.
    [Show full text]
  • Oracle Database Administrator's Reference for UNIX-Based Operating Systems
    Oracle® Database Administrator’s Reference 10g Release 2 (10.2) for UNIX-Based Operating Systems B15658-06 March 2009 Oracle Database Administrator's Reference, 10g Release 2 (10.2) for UNIX-Based Operating Systems B15658-06 Copyright © 2006, 2009, Oracle and/or its affiliates. All rights reserved. Primary Author: Brintha Bennet Contributing Authors: Kevin Flood, Pat Huey, Clara Jaeckel, Emily Murphy, Terri Winters, Ashmita Bose Contributors: David Austin, Subhranshu Banerjee, Mark Bauer, Robert Chang, Jonathan Creighton, Sudip Datta, Padmanabhan Ganapathy, Thirumaleshwara Hasandka, Joel Kallman, George Kotsovolos, Richard Long, Rolly Lv, Padmanabhan Manavazhi, Matthew Mckerley, Sreejith Minnanghat, Krishna Mohan, Rajendra Pingte, Hanlin Qian, Janelle Simmons, Roy Swonger, Lyju Vadassery, Douglas Williams This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this software or related documentation is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S.
    [Show full text]
  • Cuda C Programming Guide
    CUDA C PROGRAMMING GUIDE PG-02829-001_v7.5 | September 2015 Design Guide CHANGES FROM VERSION 7.0 ‣ Updated C/C++ Language Support to: ‣ Added new section C++11 Language Features, ‣ Clarified that values of const-qualified variables with builtin floating-point types cannot be used directly in device code when the Microsoft compiler is used as the host compiler, ‣ Documented the extended lambda feature, ‣ Documented that typeid, std::type_info, and dynamic_cast are only supported in host code, ‣ Documented the restrictions on trigraphs and digraphs, ‣ Clarified the conditions under which layout mismatch can occur on Windows. ‣ Updated Table 12 to mention support of half-precision floating-point operations on devices of compute capabilities 5.3. ‣ Updated Table 2 with throughput for half-precision floating-point instructions. ‣ Added compute capability 5.3 to Table 13. ‣ Added the maximum number of resident grids per device to Table 13. ‣ Clarified the definition of __threadfence() in Memory Fence Functions. ‣ Mentioned in Atomic Functions that atomic functions do not act as memory fences. www.nvidia.com CUDA C Programming Guide PG-02829-001_v7.5 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............4 1.3. A Scalable Programming Model.........................................................................5
    [Show full text]
  • Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers
    Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers Entwurf und Performance-Evaluierung eines Software-Frameworks für Multi-Physik-Simulationen auf heterogenen Supercomputern Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades Doktor-Ingenieur vorgelegt von Dipl.-Inf. Christian Feichtinger Erlangen, 2012 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg Tag der Einreichung: 11. Juni 2012 Tag der Promotion: 24. July 2012 Dekan: Prof. Dr. Marion Merklein Berichterstatter: Prof. Dr. Ulrich Rüde Prof. Dr. Takayuki Aoki Prof. Dr. Gerhard Wellein Abstract Despite the experience of several decades the numerical simulation of computa- tional fluid dynamics is still an enormously challenging and active research field. Most simulation tasks of scientific and industrial relevance require the model- ing of multiple physical effects, complex numerical algorithms, and have to be executed on supercomputers due to their high computational demands. Fac- ing these complexities, the reimplementation of the entire functionality for each simulation task, forced by inflexible, non-maintainable, and non-extendable im- plementations is not feasible and bound to fail. The requirements to solve the involved research objectives can only be met in an interdisciplinary effort and by a clean and structured software development process leading to usable, main- tainable, and efficient software designs on all levels of the resulting software framework. The major scientific contribution of this thesis is the thorough design and imple- mentation of the software framework WaLBerla that is suitable for the simulation of multi-physics simulation tasks centered around the lattice Boltzmann method. The design goal of WaLBerla is to be usable, maintainable, and extendable as well as to enable efficient and scalable implementations on massively parallel super- computers.
    [Show full text]
  • N2429 V 1 Function Failure Annotation
    ISO/IEC JTC 1/SC 22/WG14 N2429 September 23, 2019 v 1 Function failure annotation Niall Douglas Jens Gustedt ned Productions Ltd, Ireland INRIA & ICube, Université de Strasbourg, France We have been seeing an evolution in proposals for the best syntax for describing how to mark how a function fails, such that compilers and linkers can much better optimise function calls than at present. These papers were spurred by [WG21 P0709] Zero-overhead deterministic exceptions, which proposes deterministic exception handling for C++, whose implementation could be very compatible with C code. P0709 resulted in [WG21 P1095] Zero overhead deterministic failure – A unified mecha- nism for C and C++ which proposed a C syntax and exception-like mechanism for supporting C++ deterministic exceptions. That was responded to by [WG14 N2361] Out-of-band bit for exceptional return and errno replacement which proposed an even wider reform, whereby the most popular idioms for function failure in C code would each gain direct calling convention support in the compiler. This paper synthesises all of the proposals above into common proposal, and does not duplicate some of the detail in the above papers for brevity and clarity. It hews closely to what [WG14 N2361] proposes, but with different syntax, and in addition it shows how the corresponding C++ features can be constructed on top of it, without loosing call compatibility with C. Contents 1 Introduction 2 1.1 Differences between WG14 N2361 and WG21 P1095 . .3 2 Proposed Design 4 2.1 errno-related function failure editions . .5 2.2 When to choose which edition .
    [Show full text]
  • NVIDIA CUDA Installation Guide for Linux
    NVIDIA CUDA Installation Guide for Linux Installation and Verification on Linux Systems DU-05347-001_v11.4 | September 2021 Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. System Requirements...............................................................................................................1 1.2. About This Document............................................................................................................... 3 Chapter 2. Pre-installation Actions..................................................................................... 4 2.1. Verify You Have a CUDA-Capable GPU....................................................................................4 2.2. Verify You Have a Supported Version of Linux........................................................................ 5 2.3. Verify the System Has gcc Installed........................................................................................5 2.4. Verify the System has the Correct Kernel Headers and Development Packages Installed........................................................................................................................................5 2.5. Install MLNX_OFED.................................................................................................................. 7 2.6. Choose an Installation Method................................................................................................ 7 2.7. Download
    [Show full text]