OPTIMIZING CUDA APPLICATIONS for the VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018 NEW FEATURES in CUDA ECOSYSTEM

Total Page:16

File Type:pdf, Size:1020Kb

OPTIMIZING CUDA APPLICATIONS for the VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018 NEW FEATURES in CUDA ECOSYSTEM OPTIMIZING CUDA APPLICATIONS FOR THE VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018 NEW FEATURES IN CUDA ECOSYSTEM TURING AND NEW SYSTEMS CUDA PLATFORM New GPU Architecture, Tensor Cores, NVSwitch Fabric, CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix Multiply DGX2, RTcore Accumulate (WMMA) LIBRARIES DEVELOPER TOOLS GPU-accelerated hybrid JPEG decoding, Symmetric New Nsight Products – Nsight Systems and Nsight Compute Eigenvalue Solvers, FFT Scaling Scientific Computing 2 AGENDA New Features: Tensor Cores RTcore CUDA Graphs Nsight Developer Tools Optimization strategies: Volta/Turing Execution Model Volta/Turing Memory Subsystem 3 TENSOR CORES 4 VOLTA / TURING SM Turing SM V100 TU102 FP64 32 2 INT32 64 64 FP32 64 64 Tensor Cores 8 8 RT Core - 1 Register File 256 KB 256 KB L1 and shmem 128 KB 96 KB Max threads 2048 1024 Compute 70 75* Capability *Volta (cc70) code runs on Turing without JIT or recompile! 5 TENSOR CORES New in Volta, Extended in Turing PEAK INT8 PEAK INT4 PEAK GPU SMs Total Peak Half FLOPS OPS OPS Binary OPS V100 80 640 125 TFLOPS N.A. N.A. N.A. TU102 72 576 130.5 TFLOPS 261 TOPS 522 TOPS 2088 TOPS half precision inputs half / float accumulator 8bit/4bit INT inputs 32-bit INT accumulator 1bit Binary inputs 32-bit INT accumulator (XOR + POPC) Used via CUBLAS, CUDNN, CUTLASS, TensorRT Exposed in CUDA 10 (4bit INT and 1bit binary are experimental) 6 TURING TENSOR CORE New Warp Matrix Functions WMMA operations now include 8-bit integer WMMA 16x16x16 along with FP16 = + ▪ Warp Matrix Multiply Accumulate D A B C 16x16 16x16 16x16 16x16 ▪ Signed & unsigned 8-bit input WMMA 32x8x16 ▪ 32-bit integer accumulator = + D A B C ▪ Input/Output dimensions similar to FP16 32x8 32x16 16x8 32x8 WMMA 8x32x16 ▪ 2048 ops per cycle, per SM for 8bit = + ▪ nvcuda::wmma D A B C 8x32 8x16 16x32 8x32 7 EXPERIMENTAL WARP MATRIX FUNCTIONS Turing Enables Experimental Sub-Byte Tensor Core Operations Experimental Sub-Byte Operations namespace experimental { ▪ 4-bit signed & unsigned input namespace precision { struct u4; // 4-bit unsigned ▪ 1-bit input with custom matrix operations struct s4; // 4-bit signed ▪ 32-bit accumulator output struct b1; // 1-bit } enum bmmaBitOp { bmmaBitOpXOR = 1 }; Access via special namespace: enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } nvcuda::wmma::experimental Enable researchers to experiment with ultra low precision! Experimental subject to API changes not functionality. 8 WMMA – IMMA 4BIT New for Turing (Experimental) A B C 128 bits D = 128 bits 128 8-by-8 x int32 8-by-32 x 4b 8-by-8 x int32 32-by-8 x 4b Di,j = (Ai,k * Bk,j) + Ci,j for k = 0 .. 31 9 WMMA – BINARY - XOR POPC New for Turing (Experimental) A B C 128 bits D = 128 bits 128 8-by-8 x int32 8-by-128 x 1b 8-by-8 x int32 128-by-8 x 1b Di,j = popc(Ai,k ^ Bk,j) + Ci,j for k = 0 .. 127 10 BINARY TENSOR CORE OPERATION 128-bit population Bitwise 32-bit Integer Output 1-Bit Input Signal count added to XOR Operation Per Point accumulator Other Row/Column Results Accumulated Bitwise 32-bit Integer XOR + Count Previous Accumulation 11 NEW TURING WARP MATRIX FUNCTIONS Input Precision Output Supported Sizes Max Ops/Clock/SM half * half or float 1024 16 x 16 x 16 char 32 x 8 x 16 integer (int32) 8 x 32 x 16 2048 unsigned char Native Types Native precision::u4 (4-bit unsigned) 8 x 8 x 32 4096 precision::s4 (4-bit signed) integer (int32) precision::b1 (1-bit) 8 x 8 x 128 16384 Experimental * Also available on Volta sm_70. Note: WMMA requires recompilation for sm_75 for peak performance 12 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++ CUTLASS GEMM Structural Model 13 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++ > 90% Relative to Peak Performance 100% Turing optimized GEMMs 80% Integer (8-bit, 4-bit and 1-bit) using WMMA 60% Batched strided GEMM 40% 20% Support for CUDA 10.0 to Peak % Relative 0% Updates to documentation and more examples igemm_tt sgemm_tt igemm_nt igemm_tn dgemm_tt hgemm_tt sgemm_nt sgemm_tn igemm_nn dgemm_nt dgemm_tn hgemm_nt hgemm_tn sgemm_nn dgemm_nn hgemm_nn wmma_gemm_tt wmma_gemm_nt wmma_gemm_tn wmma_gemm_nn wmma_gemm_f16_tt wmma_gemm_f16_nt wmma_gemm_f16_tn wmma_gemm_f16_nn DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32) https://github.com/NVIDIA/cutlass 14 CUTLASS 1.1 on Volta (GV100) TURING RTCORE 15 RT Cores Turing GPU RT Cores accelerate ray tracing RT Cores perform ● Ray-BVH (Bounding Volume Hierarchy) Traversal ● Instancing: 1 Level ● Ray-Triangle Intersection Return to SM for ● Multi-level Instancing ● Custom Intersection ● Shading 16 Software v/s Hardware Ray Tracing Pre-Turing Turing SM SM Tri1 Tri2 Tri3 Circle1 17 Rtcore in OPTIX • Single-ray shader programming model using C++ • Transparently scales across multiple GPUs • AI Accelerated rendering http://developer.nvidia.com/optix • Easy interop with CUDA http://on-demand.gputechconf.com 18 CUDA GRAPHS 19 ASYNCHRONOUS TASK GRAPHS Execution Optimization When Workflow is Known Up-Front Deep Neural Network Training DL Inference Loop & Function offload Linear Algebra HPC Simulation 20 ALL CUDA WORK FORMS A GRAPH Node represents operation CUDA Work in Streams Edge represents dependency A A B Wait Any CUDA stream can be B X Wait mapped to a graph C D X C D Wait E Y E Y Wait End Implicit dependencies Explicit dependencies 21 DEFINITION OF A CUDA GRAPH Graph Nodes Are Not Just Kernel Launches Sequence of operations, connected by dependencies. A Operations are one of: B X Kernel Launch CUDA kernel running on GPU CPU Function Call Callback function on CPU C D Memcopy/Memset GPU data management E Y Sub-Graph Graphs are hierarchical End 22 NEW EXECUTION MECHANISM Graphs Can Be Generated Once Then Launched Repeatedly A B X for(int i=0; i<1000; i++) { launch_graph( G ); C D } E Y End 23 EXECUTION OPTIMIZATIONS Latency & Overhead Reductions Launch latencies: ▪ CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on Linux ▪ Pre-defined graph allows launch of any number of kernels in one single operation Launch Launch Launch Launch Launch CPU Idle A B C D E A B C D E time Build Launch Graph CPU Idle Graph A B C D E 24 PERFORMANCE IMPACT Optimizations for Short-Runtime Operations CPU launch time improvements Example: Small 3D FFT Typical: 33% faster than stream launch 25% end-to-end improvement for 323 3D-FFT (16us with stream launch, 12us with graph launch) NOTE: Performance impact is workload-dependent Benefits especially short-running kernels, where overheads account for more runtime 25 THREE-STAGE EXECUTION MODEL Define Instantiate Execute A A s1 s2 s3 B X A B X A C D B X B X C D E C Y D E C Y D E Y End E Y End End End Executable Graphs Single Graph “Template” Multiple “Executable Graphs” Running in CUDA Streams Created in host code, Snapshot of template Concurrency in graph or loaded from disk, Sets up & initializes GPU is not limited by stream or built up from libraries execution structures (see later) (create once, run many times) 26 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax // Start by initating stream capture cudaStreamBeginCapture(&stream1); // Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); B B C B<<< ..., stream1 >>>(); C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2); D<<< ..., stream1 >>>(); stream1 stream2 graph // Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph); 27 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax // Start by initating stream capture cudaStreamBeginCapture(&stream1); // Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); Capture follows B inter-stream dependencies B<<< ..., stream1 >>>(); B C to create forks & joins C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2); D<<< ..., stream1 >>>(); stream1 stream2 graph // Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph); 28 CREATE GRAPHS DIRECTLY Map Graph-Based Workflows Directly Into CUDA // Define graph of work + dependencies cudaGraphCreate(&graph); cudaGraphAddNode(graph, kernel_a, {}, ...); A cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...); B C // Instantiate graph and apply optimizations D cudaGraphInstantiate(&instance, graph); Graph from framework // Launch executable graph 100 times for(int i=0; i<100; i++) cudaGraphLaunch(instance, stream); 29 GRAPH EXECUTION SEMANTICS Order Graph Work With Other Non-Graph CUDA Work stream launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2, CPU_Func cpu, cudaStream_t stream) { A A <<< 256, 256, 0, stream >>>(); // Kernel launch cudaGraphLaunch(i1, stream); // Graph1 launch cudaStreamAddCallback(stream, cpu); // CPU callback CPU cudaGraphLaunch(i2, stream); // Graph2 launch cudaStreamSynchronize(stream); } If you can put it in a CUDA stream, you can run it together with a graph 30 GRAPHS IGNORE STREAM SERIALIZATION RULES Launch Stream Is Used Only For Ordering With Other Work stream A A B X Branches in graph still C D execute concurrently CPU even though graph is launched into a stream E Y End 31 CROSS-DEVICE DEPENDENCIES Graphs May Span Multiple GPUs Multi-Device Heterogeneous Execution Execution A GPU CUDA is closest to the O/S and the hardware ▪ Can optimize
Recommended publications
  • GPU Developments 2018
    GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • Adding RTX Acceleration to Iray with Optix 7
    Adding RTX acceleration to Iray with OptiX 7 Lutz Kettner Director Advanced Rendering and Materials July 30th, SIGGRAPH 2019 What is Iray? Production Rendering on CUDA In Production since > 10 Years Bring ray tracing based production / simulation quality rendering to GPUs New paradigm: Push Button rendering (open up new markets) Plugins for 3ds Max Maya Rhino SketchUp … 2 SIMULATION QUALITY 3 iray legacy ARTISTIC FREEDOM 4 How Does it Work? 99% physically based Path Tracing To guarantee simulation quality and Push Button • Limit shortcuts and good enough hacks to minimum • Brute force (spectral) simulation no intermediate filtering scale over multiple GPUs and hosts even in interactive use GTC 2014 19 VCA * 8 Q6000 GPUs 5 How Does it Work? 99% physically based Path Tracing To guarantee simulation quality and Push Button • Limit shortcuts and good enough hacks to minimum • Brute force (spectral) simulation no intermediate filtering scale over multiple GPUs and hosts even in interactive use • Two-way path tracing from camera and (opt.) lights 6 How Does it Work? 99% physically based Path Tracing To guarantee simulation quality and Push Button • Limit shortcuts and good enough hacks to minimum • Brute force (spectral) simulation no intermediate filtering scale over multiple GPUs and hosts even in interactive use • Two-way path tracing from camera and (opt.) lights • Use NVIDIA Material Definition Language (MDL) 7 How Does it Work? 99% physically based Path Tracing To guarantee simulation quality and Push Button • Limit shortcuts and good
    [Show full text]
  • The Growing Importance of Ray Tracing Due to Gpus
    NVIDIA Application Acceleration Engines advancing interactive realism & development speed July 2010 NVIDIA Application Acceleration Engines A family of highly optimized software modules, enabling software developers to supercharge applications with high performance capabilities that exploit NVIDIA GPUs. Easy to acquire, license and deploy (most being free) Valuable features and superior performance can be quickly added App’s stay pace with GPU advancements (via API abstraction) NVIDIA Application Acceleration Engines PhysX physics & dynamics engine breathing life into real-time 3D; Apex enabling 3D animators CgFX programmable shading engine enhancing realism across platforms and hardware SceniX scene management engine the basis of a real-time 3D system CompleX scene scaling engine giving a broader/faster view on massive data OptiX ray tracing engine making ray tracing ultra fast to execute and develop iray physically correct, photorealistic renderer, from mental images making photorealism easy to add and produce © 2010 Application Acceleration Engines PhysX • Streamlines the adoption of latest GPU capabilities, physics & dynamics getting cutting-edge features into applications ASAP, CgFX exploiting the full power of larger and multiple GPUs programmable shading • Gaining adoption by key ISVs in major markets: SceniX scene • Oil & Gas Statoil, Open Inventor management • Design Autodesk, Dassault Systems CompleX • Styling Autodesk, Bunkspeed, RTT, ICIDO scene scaling • Digital Content Creation Autodesk OptiX ray tracing • Medical Imaging N.I.H iray photoreal rendering © 2010 Accelerating Application Development App Example: Auto Styling App Example: Seismic Interpretation 1. Establish the Scene 1. Establish the Scene = SceniX = SceniX 2. Maximize interactive 2. Maximize data visualization quality + quad buffered stereo + CgFX + OptiX + volume rendering + ambient occlusion 3.
    [Show full text]
  • RTX Beyond Ray Tracing
    RTX Beyond Ray Tracing Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location -Now, let’s run a lot of experiments … I Wald (NVIDIA), W Usher, N Morrical, L Lediaev, V Pascucci (University of Utah) Motivation – What this is about - In this paper: We accelerate Unstructured-Data (Tet Mesh) Volume Ray Casting… NVIDIA Confidential Motivation – What this is about - In this paper: We accelerate Unstructured-Data (Tet Mesh) Volume Ray Casting… - But: This is not what this is (primarily) about - Volume rendering is just a “proof of concept”. - Original question: “What else” can you do with RTX? - Remember the early 2000’s (e.g., “register combiners”): Lots of innovation around “using graphics hardware for non- graphics problems”. - Since CUDA: Much of that has been subsumed through CUDA - Today: Now that we have new hardware units (RTX, Tensor Cores), what else could we (ab-)use those for? (“(ab-)use” as in “use for something that it wasn’t intended for”) NVIDIA Confidential Motivation – What this is about - In this paper: We accelerate Unstructured-Data (Tet Mesh) Volume Ray Casting… - But: This is not what this is (primarily) about - Volume rendering is just a “proof of concept”. - Original question: “What else” can you do with RTX? - Remember the early 2000’s (e.g., “register combiners”): Lots of innovation around “using graphics hardware for non- graphics →problems”.Two main goal(s) of this paper: -a)SinceGet CUDA: readers Much ofto that think has beenabout subsumed the “what through else”s CUDA… - Today: Nowb) Showthat
    [Show full text]
  • RTX-Accelerated Hair Brought to Life with NVIDIA Iray (GTC 2020 S22494)
    RTX-accelerated Hair brought to Life with NVIDIA Iray (GTC 2020 S22494) Carsten Waechter, March 2020 What is Iray? Production Rendering on CUDA In Production since > 10 Years Bring ray tracing based production / simulation quality rendering to GPUs New paradigm: Push Button rendering (open up new markets) Plugins for 3ds Max Maya Rhino SketchUp … … … 2 What is Iray? NVIDIA testbed and inspiration for new tech NVIDIA Material Definition Language (MDL) evolved from internal material representation into public SDK NVIDIA OptiX 7 co-development, verification and guinea pig NVIDIA RTX / RT Cores scene- and ray-dumps to drive hardware requirements NVIDIA Maxwell…NVIDIA Turing (& future) enhancements profiling/experiments resulting in new features/improvements Design and test/verify NVIDIA’s new Headquarter (in VR) close cooperation with Gensler 3 Simulation Quality 4 iray legacy Artistic Freedom 5 How Does it Work? 99% physically based Path Tracing To guarantee simulation quality and Push Button • Limit shortcuts and good enough hacks to minimum • Brute force (spectral) simulation no intermediate filtering scale over multiple GPUs and hosts even in interactive use • Two-way path tracing from camera and (opt.) lights • Use NVIDIA Material Definition Language (MDL) • NVIDIA AI Denoiser to clean up remaining noise 6 How Does it Work? 99% physically based Path Tracing To guarantee simulation quality and Push Button • Limit shortcuts and good enough hacks to minimum • Brute force (spectral) simulation no intermediate filtering scale over multiple
    [Show full text]
  • NVIDIA Ampere GA102 GPU Architecture Whitepaper
    NVIDIA AMPERE GA102 GPU ARCHITECTURE Second-Generation RTX Updated with NVIDIA RTX A6000 and NVIDIA A40 Information V2.0 Table of Contents Introduction 5 GA102 Key Features 7 2x FP32 Processing 7 Second-Generation RT Core 7 Third-Generation Tensor Cores 8 GDDR6X and GDDR6 Memory 8 Third-Generation NVLink® 8 PCIe Gen 4 9 Ampere GPU Architecture In-Depth 10 GPC, TPC, and SM High-Level Architecture 10 ROP Optimizations 11 GA10x SM Architecture 11 2x FP32 Throughput 12 Larger and Faster Unified Shared Memory and L1 Data Cache 13 Performance Per Watt 16 Second-Generation Ray Tracing Engine in GA10x GPUs 17 Ampere Architecture RTX Processors in Action 19 GA10x GPU Hardware Acceleration for Ray-Traced Motion Blur 20 Third-Generation Tensor Cores in GA10x GPUs 24 Comparison of Turing vs GA10x GPU Tensor Cores 24 NVIDIA Ampere Architecture Tensor Cores Support New DL Data Types 26 Fine-Grained Structured Sparsity 26 NVIDIA DLSS 8K 28 GDDR6X Memory 30 RTX IO 32 Introducing NVIDIA RTX IO 33 How NVIDIA RTX IO Works 34 Display and Video Engine 38 DisplayPort 1.4a with DSC 1.2a 38 HDMI 2.1 with DSC 1.2a 38 Fifth Generation NVDEC - Hardware-Accelerated Video Decoding 39 AV1 Hardware Decode 40 Seventh Generation NVENC - Hardware-Accelerated Video Encoding 40 NVIDIA Ampere GA102 GPU Architecture ii Conclusion 42 Appendix A - Additional GeForce GA10x GPU Specifications 44 GeForce RTX 3090 44 GeForce RTX 3070 46 Appendix B - New Memory Error Detection and Replay (EDR) Technology 49 Appendix C - RTX A6000 GPU Perf ormance 50 List of Figures Figure 1.
    [Show full text]
  • Nx Witness User Manual Contents
    User Manual Still need help? Visit us at http://support.networkoptix.com Nx Witness User Manual Contents Table of Contents Working with Nx Witness 1 Opening................................................................................................................................... and Closing Nx Witness Client 1 Connecting................................................................................................................................... to Nx Witness via Web-Client 3 Connecting................................................................................................................................... to Enterprise Controller and Working Offline 4 Launching................................................................................................................................... Nx Witness in Compatibility Mode 7 Introducing................................................................................................................................... User Roles 8 Nx Witness................................................................................................................................... User Interface Overview 9 Main Menu .......................................................................................................................................................... 10 Show ing and ..........................................................................................................................................................Hiding Side Panels 11 Tabs and Layouts.........................................................................................................................................................
    [Show full text]
  • Exploiting Hardware-Accelerated Ray Tracing for Monte Carlo Particle Transport with Openmc
    Exploiting Hardware-Accelerated Ray Tracing for Monte Carlo Particle Transport with OpenMC Justin Lewis Salmon, Simon McIntosh Smith Department of Computer Science University of Bristol Bristol, U.K. fjustin.salmon.2018, [email protected] Abstract—OpenMC is a CPU-based Monte Carlo particle HPC due to their proliferation in modern supercomputer transport simulation code recently developed in the Computa- designs such as Summit [6]. tional Reactor Physics Group at MIT, and which is currently being evaluated by the UK Atomic Energy Authority for use on A. OpenMC the ITER fusion reactor project. In this paper we present a novel port of OpenMC to run on the new ray tracing (RT) cores in OpenMC is a Monte Carlo particle transport code focussed NVIDIA’s latest Turing GPUs. We show here that the OpenMC on neutron criticality simulations, recently developed in the GPU port yields up to 9.8x speedup on a single node over a Computational Reactor Physics Group at MIT [7]. OpenMC 16-core CPU using the native constructive solid geometry, and is written in modern C++, and has been developed using up to 13x speedup using approximate triangle mesh geometry. Furthermore, since the expensive 3D geometric operations re- high code quality standards to ensure maintainability and quired during particle transport simulation can be formulated consistency. This is in contrast to many older codes, which as a ray tracing problem, there is an opportunity to gain even are often written in obsolete versions of Fortran, and have higher performance on triangle meshes by exploiting the RT grown to become highly complex and difficult to maintain.
    [Show full text]
  • NVIDIA's Opengl Functionality
    NVIDIANVIDIA ’’ss OpenGLOpenGL FunctionalityFunctionality Session 2127 | Room A5 | Monday, September, 20th, 16:00 - 17:20 San Jose Convention Center, San Jose, California Mark J. Kilgard • Principal System Software Engineer – OpenGL driver – Cg (“C for graphics”) shading language • OpenGL Utility Toolkit (GLUT) implementer • Author of OpenGL for the X Window System • Co-author of Cg Tutorial Outline • OpenGL’s importance to NVIDIA • OpenGL 3.3 and 4.0 • OpenGL 4.1 • Loose ends: deprecation, Cg, further extensions OpenGL Leverage Cg Parallel Nsight SceniX CompleX OptiX Example of Hybrid Rendering with OptiX OpenGL (Rasterization) OptiX (Ray tracing) Parallel Nsight Provides OpenGL Profiling Configure Application Trace Settings Parallel Nsight Provides OpenGL Profiling Magnified trace options shows specific OpenGL (and Cg) tracing options Parallel Nsight Provides OpenGL Profiling Parallel Nsight Provides OpenGL Profiling Trace of mix of OpenGL and CUDA shows glFinish & OpenGL draw calls OpenGL In Every NVIDIA Business OpenGL on Quadro – World class OpenGL 4 drivers – 18 years of uninterrupted API compatibility – Workstation application certifications – Workstation application profiles – Display list optimizations – Fast antialiased lines – Largest memory configurations: 6 gigabytes – GPU affinity – Enhanced interop with CUDA and multi-GPU OpenGL – Advanced multi-GPU rendering – Overlays – Genlock – Unified Back Buffer for less framebuffer memory usage – Cross-platform • Windows XP, Vista, Win7, Linux, Mac, FreeBSD, Solaris – SLI Mosaic –
    [Show full text]
  • Release Notes for the NVIDIA® Optix™ Ray Tracing Engine
    Release Notes for the NVIDIA® OptiX™ ray tracing engine Version 3.0.0 November 2012 Welcome to the latest release of the NVIDIA OptiX ray tracing engine and SDK, with support for all CUDA-capable GPUs. This package contains the libraries required to experience the latest technology for programmable GPU ray tracing, plus pre-compiled samples (with source code) demonstrating a broad range of ray tracing techniques and highlighting basic functionality. Support: The normal path for OptiX support is on NVIDIA's Developer Zone at: https://devtalk.nvidia.com/default/board/90/ If you have any confidential concerns please send your issues directly to [email protected] and they will be addressed by the development team. You can continue to download OptiX from http://developer.nvidia.com/optix-download/ System Requirements (for running binaries referencing OptiX) Graphics Hardware: CUDA capable devices (G80 or later) are supported on GeForce, Quadro, or Tesla class products. Kepler GK104, GK107 and GK110 GPUs are now supported. Multiple devices/GPUs are only supported on “GT200”, “Fermi” and “Kepler” generation GPUs. Out-of-core ray tracing of large datasets is only supported on Quadro and Tesla GPUs. Graphics Driver: The CUDA R300 or later driver is required. The latest drivers available are highly recommended (307.4 or later for Windows, 310.19 for Linux and the CUDA 5.0 driver extension for Mac). For the Mac, the driver extension module supplied with CUDA 5.0 or later will need to be installed. Note that the Linux and Mac drivers can only be obtained from the CUDA 5.0 download page at the moment.
    [Show full text]
  • GPU-Accelerated Optix Ray Tracing for Scientific Visualization
    GPU-Accelerated OptiX Ray Tracing for Scientific Visualization John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ 10:00-10:25, NVIDIA Theater, Siggraph 2018 Vancouver BC, Canada, Thursday August 16th, 2018 Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu VMD – “Visual Molecular Dynamics” • Visualization and analysis of: – Molecular dynamics simulations – Lattice cell simulations – Quantum chemistry calculations – Cryo-EM densities, volumetric data – Sequence information • User extensible scripting and plugins • http://www.ks.uiuc.edu/Research/vmd/ Cell-Scale Modeling MD Simulation Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu Goal: A Computational Microscope Study the molecular machines in living cells Ribosome: target for antibiotics Poliovirus Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu Goal: Intuitive interactive viz. in crowded molecular complexes Results from 64M atom, 1 μs sim! Biomedical Technology Research CenterClose- for Macromolecularup view Modeling of chloride and Bioinformatics ions permeating through Beckman Institute, University
    [Show full text]
  • VMD User's Guide
    VMD User’s Guide Version 1.9.4a48 October 13, 2020 NIH Biomedical Research Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group1 Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign 405 N. Mathews Urbana, IL 61801 http://www.ks.uiuc.edu/Research/vmd/ Description The VMD User’s Guide describes how to run and use the molecular visualization and analysis program VMD. This guide documents the user interfaces displaying and grapically manipulating molecules, and describes how to use the scripting interfaces for analysis and to customize the behavior of VMD. VMD development is supported by the National Institutes of Health grant numbers NIH P41- GM104601. 1http://www.ks.uiuc.edu/ Contents 1 Introduction 11 1.1 Contactingtheauthors. ....... 12 1.2 RegisteringVMD.................................. 12 1.3 CitationReference ............................... ...... 12 1.4 Acknowledgments................................. ..... 13 1.5 Copyright and Disclaimer Notices . .......... 13 1.6 For information on our other software . .......... 15 2 Hardware and Software Requirements 17 2.1 Basic Hardware and Software Requirements . ........... 17 2.2 Multi-core CPUs and GPU Acceleration . ......... 17 2.3 Parallel Computing on Clusters and Supercomputers . .............. 18 3 Tutorials 19 3.1 RapidIntroductiontoVMD. ...... 19 3.2 Viewing a molecule: Myoglobin . ........ 19 3.3 RenderinganImage ................................ 21 3.4 AQuickAnimation................................. 21 3.5 An Introduction to Atom Selection . ......... 22 3.6 ComparingTwoStructures . ...... 22 3.7 SomeNiceRepresenations . ....... 23 3.8 Savingyourwork.................................. 24 3.9 Tracking Script Command Versions of the GUI Actions . ............ 24 4 Loading A Molecule 26 4.1 Notes on common molecular file formats . ......... 26 4.2 Whathappenswhenafileisloaded? . ....... 27 4.3 Babelinterface .................................
    [Show full text]