How a GPU Works Kayvon Fatahalian 15-462 (Fall 2011) Today

Total Page:16

File Type:pdf, Size:1020Kb

How a GPU Works Kayvon Fatahalian 15-462 (Fall 2011) Today How a GPU Works Kayvon Fatahalian 15-462 (Fall 2011) Today 1. Review: the graphics pipeline 2. History: a few old GPUs 3. How a modern GPU works (and why it is so fast!) 4. Closer look at a real GPU design – NVIDIA GTX 285 2 Part 1: The graphics pipeline (an abstraction) 3 Vertex processing Vertices are transformed into “screen space” v2 v0 v5 v4 v3 v1 Vertices Vertex processing Vertices are transformed into “screen space” v2 v0 v5 EACH VERTEX IS TRANSFORMED v4 INDEPENDENTLY v3 v1 Vertices Primitive processing Then organized into primitives that are clipped and culled… v2 v2 v0 v0 v5 v5 v4 v4 v1 v3 v3 v1 Vertices Primitives (triangles) Rasterization Primitives are rasterized into “pixel fragments” Fragments Rasterization Primitives are rasterized into “pixel fragments” EACH PRIMITIVE IS RASTERIZED INDEPENDENTLY Fragment processing Fragments are shaded to compute a color at each pixel Shaded fragments Fragment processing Fragments are shaded to compute a color at each pixel EACH FRAGMENT IS PROCESSED INDEPENDENTLY Pixel operations Fragments are blended into the frame buffer at their pixel locations (z-buffer determines visibility) Pixels Pipeline entities v2 v2 v0 v0 v5 v5 v4 v4 v3 v1 v3 v1 Vertices Primitives Fragments Fragments (shaded) Pixels Graphics pipeline Memory Bu!ers Vertex Generation Vertex Data Bu!ers Fixed-function Vertices Vertex stream Programmable Vertex Processing Textures Vertex stream Primitive Generation Primitive stream Primitives Primitive Processing Textures Primitive stream Fragment Generation Fragment stream Fragments Fragment Processing Textures Fragment stream Pixels Pixel Operations Output image (pixels) Part 2: Graphics architectures (implementations of the graphics pipeline) 14 Independent • What’s so important about “independent” computations? 15 Silicon Graphics RealityEngine (1993) “graphics supercomputer” Vertex Generation Vertex Processing Primitive Generation Primitive Processing Fragment Generation Fragment Processing Pixel Operations Pre-1999 PC 3D graphics accelerator Vertex Generation CPU Vertex Processing Primitive Generation Primitive Processing 3dfx Voodoo Clip/cull/rasterize Fragment Generation NVIDIA RIVA TNT Tex Tex Pixel operations Fragment Processing Pixel Operations GPU* circa 1999 CPU Vertex Generation Vertex Processing GPU Primitive Generation Primitive Processing Fragment Generation NVIDIA GeForce 256 Fragment Processing Pixel Operations Direct3D 9 programmability: 2002 Vertex Generation Vtx Vtx Vtx Vtx Vertex Processing Clip/cull/rasterize Primitive Generation Tex Tex Tex Tex Frag Frag Frag Frag Primitive Processing Tex Tex Tex Tex Frag Frag Frag Frag Fragment Generation Pixel operations Fragment Processing ATI Radeon 9700 Pixel Operations Direct3D 10 programmability: 2006 Vertex Generation Vertex Processing Core Core Core Core Pixel op Pixel op Core Core Core Core Pixel op Pixel op Primitive Generation Tex Pixel op Pixel op Core Core Core Core Clip/Cull/Rast Primitive Processing Core Core Core Core Scheduler Fragment Generation NVIDIA GeForce 8800 Fragment Processing (“unified shading” GPU) Pixel Operations Part 3: How a shader core works (three key ideas) 21 GPUs are fast Intel Core i7 Quad Core AMD Radeon HD 5870 ~100 GFLOPS peak ~2.7 TFLOPS peak 730 million transistors 2.2 billion transistors (obtainable if you code your program to (obtainable if you write OpenGL programs use 4 threads and SSE vector instr) like you’ve done in this class) 22 A di!use re"ectance shader sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; Shader programming model: float4(diffuseShader(float3(norm,(float2(uv) { Fragments are processed independently, ((float3(kd; but there is no explicit parallel ((kd(=(myTex.Sample(mySamp,(uv); programming. ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); ((return(float4(kd,(1.0);((( }( Independent logical sequence of control per fragment. *** A di!use re"ectance shader sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; Shader programming model: float4(diffuseShader(float3(norm,(float2(uv) { Fragments are processed independently, ((float3(kd; but there is no explicit parallel ((kd(=(myTex.Sample(mySamp,(uv); programming. ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); ((return(float4(kd,(1.0);((( }( Independent logical sequence of control per fragment. *** A di!use re"ectance shader sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; Shader programming model: float4(diffuseShader(float3(norm,(float2(uv) { Fragments are processed independently, ((float3(kd; but there is no explicit parallel ((kd(=(myTex.Sample(mySamp,(uv); programming. ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); ((return(float4(kd,(1.0);((( }( Independent logical sequence of control per fragment. *** Big Guy, lookin’ di!use Compile shader 1 unshaded fragment input record sampler(mySamp; Texture2D<float3>(myTex; float3(lightDir; <diffuseShader>: sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] float4(diffuseShader(float3(norm,(float2(uv) madd(r3,(v1,(cb0[1],(r3 { madd(r3,(v2,(cb0[2],(r3 ((float3(kd; clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 ((kd(=(myTex.Sample(mySamp,(uv); mul((o1,(r1,(r3 ((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); mul((o2,(r2,(r3 ((return(float4(kd,(1.0);((( mov((o3,(l(1.0) } 1 shaded fragment output record Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Execute shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU mul((r3,(v0,(cb0[0] (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) Execution mul((o0,(r0,(r3 Context mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) “CPU-style” cores Fetch/ Decode Data cache (a big one) ALU (Execute) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher Slimming down Fetch/ Decode Idea #1: ALU (Execute) Remove components that help a single instruction Execution stream run fast Context Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseShader>: <diffuseShader>: sample(r0,(v4,(t0,(s0 ALU ALU sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3 (Execute) (Execute) madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 mul((o0,(r0,(r3 mul((o1,(r1,(r3 mul((o1,(r1,(r3 mul((o2,(r2,(r3 Execution Execution mul((o2,(r2,(r3 mov((o3,(l(1.0) Context Context mov((o3,(l(1.0) Four cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Sixteen cores (sixteen fragments in parallel) 16 cores = 16 simultaneous instruction streams Instruction stream sharing But ... many fragments should be able to share an instruction stream! <diffuseShader>: sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 mul((o1,(r1,(r3 mul((o2,(r2,(r3 mov((o3,(l(1.0) Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context Add ALUs Idea #2: Fetch/ Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction ALU 5 ALU 6 ALU 7 ALU 8 stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing Ctx Ctx Ctx Ctx Shared Ctx Data Modifying the shader Fetch/ Decode <diffuseShader>: sample(r0,(v4,(t0,(s0 mul((r3,(v0,(cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 madd(r3,(v1,(cb0[1],(r3 madd(r3,(v2,(cb0[2],(r3 ALU 5 ALU 6 ALU 7 ALU 8 clmp(r3,(r3,(l(0.0),(l(1.0) mul((o0,(r0,(r3 mul((o1,(r1,(r3 Ctx Ctx Ctx Ctx mul((o2,(r2,(r3 mov((o3,(l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers Modifying the shader Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0 VEC8_mul((vec_r3,(vec_v0,(cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd(vec_r3,(vec_v1,(cb0[1],(vec_r3 VEC8_madd(vec_r3,(vec_v2,(cb0[2],(vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0) VEC8_mul((vec_o0,(vec_r0,(vec_r3 VEC8_mul((vec_o1,(vec_r1,(vec_r3 Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3 VEC8_mov((o3,(l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes eight fragments using vector ops on vector registers Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0 VEC8_mul((vec_r3,(vec_v0,(cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd(vec_r3,(vec_v1,(cb0[1],(vec_r3 VEC8_madd(vec_r3,(vec_v2,(cb0[2],(vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0) VEC8_mul((vec_o0,(vec_r0,(vec_r3 VEC8_mul((vec_o1,(vec_r1,(vec_r3 Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3 VEC8_mov((o3,(l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data 128 fragments in parallel 16 cores = 128 ALUs , 16 simultaneous instruction streams vertices/fragments primitives 128 [ OpenCL work items ] in parallel CUDA threads vertices primitives fragments But what about branches? 1 2 .
Recommended publications
  • GLSL 4.50 Spec
    The OpenGL® Shading Language Language Version: 4.50 Document Revision: 7 09-May-2017 Editor: John Kessenich, Google Version 1.1 Authors: John Kessenich, Dave Baldwin, Randi Rost Copyright (c) 2008-2017 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast, or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be reformatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group website should be included whenever possible with specification distributions.
    [Show full text]
  • First Person Shooting (FPS) Game
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072 Thunder Force - First Person Shooting (FPS) Game Swati Nadkarni1, Panjab Mane2, Prathamesh Raikar3, Saurabh Sawant4, Prasad Sawant5, Nitesh Kuwalekar6 1 Head of Department, Department of Information Technology, Shah & Anchor Kutchhi Engineering College 2 Assistant Professor, Department of Information Technology, Shah & Anchor Kutchhi Engineering College 3,4,5,6 B.E. student, Department of Information Technology, Shah & Anchor Kutchhi Engineering College ----------------------------------------------------------------***----------------------------------------------------------------- Abstract— It has been found in researches that there is an have challenged hardware development, and multiplayer association between playing first-person shooter video games gaming has been integral. First-person shooters are a type of and having superior mental flexibility. It was found that three-dimensional shooter game featuring a first-person people playing such games require a significantly shorter point of view with which the player sees the action through reaction time for switching between complex tasks, mainly the eyes of the player character. They are unlike third- because when playing fps games they require to rapidly react person shooters in which the player can see (usually from to fast moving visuals by developing a more responsive mind behind) the character they are controlling. The primary set and to shift back and forth between different sub-duties. design element is combat, mainly involving firearms. First person-shooter games are also of ten categorized as being The successful design of the FPS game with correct distinct from light gun shooters, a similar genre with a first- direction, attractive graphics and models will give the best person perspective which uses light gun peripherals, in experience to play the game.
    [Show full text]
  • ATI Radeon™ HD 4870 Computation Highlights
    AMD Entering the Golden Age of Heterogeneous Computing Michael Mantor Senior GPU Compute Architect / Fellow AMD Graphics Product Group [email protected] 1 The 4 Pillars of massively parallel compute offload •Performance M’Moore’s Law Î 2x < 18 Month s Frequency\Power\Complexity Wall •Power Parallel Î Opportunity for growth •Price • Programming Models GPU is the first successful massively parallel COMMODITY architecture with a programming model that managgped to tame 1000’s of parallel threads in hardware to perform useful work efficiently 2 Quick recap of where we are – Perf, Power, Price ATI Radeon™ HD 4850 4x Performance/w and Performance/mm² in a year ATI Radeon™ X1800 XT ATI Radeon™ HD 3850 ATI Radeon™ HD 2900 XT ATI Radeon™ X1900 XTX ATI Radeon™ X1950 PRO 3 Source of GigaFLOPS per watt: maximum theoretical performance divided by maximum board power. Source of GigaFLOPS per $: maximum theoretical performance divided by price as reported on www.buy.com as of 9/24/08 ATI Radeon™HD 4850 Designed to Perform in Single Slot SP Compute Power 1.0 T-FLOPS DP Compute Power 200 G-FLOPS Core Clock Speed 625 Mhz Stream Processors 800 Memory Type GDDR3 Memory Capacity 512 MB Max Board Power 110W Memory Bandwidth 64 GB/Sec 4 ATI Radeon™HD 4870 First Graphics with GDDR5 SP Compute Power 1.2 T-FLOPS DP Compute Power 240 G-FLOPS Core Clock Speed 750 Mhz Stream Processors 800 Memory Type GDDR5 3.6Gbps Memory Capacity 512 MB Max Board Power 160 W Memory Bandwidth 115.2 GB/Sec 5 ATI Radeon™HD 4870 X2 Incredible Balance of Performance,,, Power, Price
    [Show full text]
  • An Evolution of Mobile Graphics
    AN EVOLUTION OF MOBILE GRAPHICS Michael C. Shebanow Vice President, Advanced Processor Lab Samsung Electronics July 20, 20131 DISCLAIMER • The views herein are my own • They do not represent Samsung’s vision nor product plans 2 • The Mobile Market • Review of GPU Tech • GPU Efficiency • User Experience • Tech Challenges • Summary 3 The Rise of the Mobile GPU & Connectivity A NEW WORLD COMING? 4 DISCRETE GPU MARKET Flattening 5 MOBILE GPU MARKET Smart • In 2012, an estimated 800+ Phones million mobile GPUs shipped “Phablets” • ~123M tablets • ~712M smart phones Tablets • Will easily exceed 1B in the coming years • Trend: • Discrete GPU relatively flat • Mobile is growing rapidly 6 WW INTERNET TRAFFIC • Source: Cisco VNI Mobile INET IP Traffic growth Traffic • Internet traffic growth Year (TB/sec) rate (TB/sec) rate is staggering 2005 0.9 0.00 2006 1.5 65% 0.00 • 2012 total traffic is 2007 2.5 61% 0.01 13.7 GB per person 2008 3.8 54% 0.01 per month 2009 5.6 45% 0.04 2010 7.8 40% 0.10 • 2012 smart phone 2011 10.6 36% 0.23 traffic at 2012 12.4 17% 0.34 0.342 GB per person per month • 2017 smart phone traffic expected at 2.7 GB per person per month 7 WHERE ARE WE HEADED?… • Enormous quantity of GPUs • Large amount of interconnectivity • Better I/O 8 GPU Pipelines A BRIEF REVIEW OF GPU TECH 9 MOBILE GPU PIPELINE ARCHITECTURES Tile-based immediate mode rendering IA VS CCV RS PS ROP (TBIMR) Tile-based deferred IA VS CCV scene rendering (TBDR) RS PS ROP IA = input assembler VS = vertex shader CCV = cull, clip, viewport transform RS = rasterization,
    [Show full text]
  • Driver Riva Tnt2 64
    Driver riva tnt2 64 click here to download The following products are supported by the drivers: TNT2 TNT2 Pro TNT2 Ultra TNT2 Model 64 (M64) TNT2 Model 64 (M64) Pro Vanta Vanta LT GeForce. The NVIDIA TNT2™ was the first chipset to offer a bit frame buffer for better quality visuals at higher resolutions, bit color for TNT2 M64 Memory Speed. NVIDIA no longer provides hardware or software support for the NVIDIA Riva TNT GPU. The last Forceware unified display driver which. version now. NVIDIA RIVA TNT2 Model 64/Model 64 Pro is the first family of high performance. Drivers > Video & Graphic Cards. Feedback. NVIDIA RIVA TNT2 Model 64/Model 64 Pro: The first chipset to offer a bit frame buffer for better quality visuals Subcategory, Video Drivers. Update your computer's drivers using DriverMax, the free driver update tool - Display Adapters - NVIDIA - NVIDIA RIVA TNT2 Model 64/Model 64 Pro Computer. (In Windows 7 RC1 there was the build in TNT2 drivers). http://kemovitra. www.doorway.ru Use the links on this page to download the latest version of NVIDIA RIVA TNT2 Model 64/Model 64 Pro (Microsoft Corporation) drivers. All drivers available for. NVIDIA RIVA TNT2 Model 64/Model 64 Pro - Driver Download. Updating your drivers with Driver Alert can help your computer in a number of ways. From adding. Nvidia RIVA TNT2 M64 specs and specifications. Price comparisons for the Nvidia RIVA TNT2 M64 and also where to download RIVA TNT2 M64 drivers. Windows 7 and Windows Vista both fail to recognize the Nvidia Riva TNT2 ( Model64/Model 64 Pro) which means you are restricted to a low.
    [Show full text]
  • Advanced Computer Graphics to Do Motivation Real-Time Rendering
    To Do Advanced Computer Graphics § Assignment 2 due Feb 19 § Should already be well on way. CSE 190 [Winter 2016], Lecture 12 § Contact us for difficulties etc. Ravi Ramamoorthi http://www.cs.ucsd.edu/~ravir Motivation Real-Time Rendering § Today, create photorealistic computer graphics § Goal: interactive rendering. Critical in many apps § Complex geometry, lighting, materials, shadows § Games, visualization, computer-aided design, … § Computer-generated movies/special effects (difficult or impossible to tell real from rendered…) § Until 10-15 years ago, focus on complex geometry § CSE 168 images from rendering competition (2011) § § But algorithms are very slow (hours to days) Chasm between interactivity, realism Evolution of 3D graphics rendering Offline 3D Graphics Rendering Interactive 3D graphics pipeline as in OpenGL Ray tracing, radiosity, photon mapping § Earliest SGI machines (Clark 82) to today § High realism (global illum, shadows, refraction, lighting,..) § Most of focus on more geometry, texture mapping § But historically very slow techniques § Some tweaks for realism (shadow mapping, accum. buffer) “So, while you and your children’s children are waiting for ray tracing to take over the world, what do you do in the meantime?” Real-Time Rendering SGI Reality Engine 93 (Kurt Akeley) Pictures courtesy Henrik Wann Jensen 1 New Trend: Acquired Data 15 years ago § Image-Based Rendering: Real/precomputed images as input § High quality rendering: ray tracing, global illumination § Little change in CSE 168 syllabus, from 2003 to
    [Show full text]
  • Developer Tools Showcase
    Developer Tools Showcase Randy Fernando Developer Tools Product Manager NVISION 2008 Software Content Creation Performance Education Development FX Composer Shader PerfKit Conference Presentations Debugger mental mill PerfHUD Whitepapers Artist Edition Direct3D SDK PerfSDK GPU Programming Guide NVIDIA OpenGL SDK Shader Library GLExpert Videos CUDA SDK NV PIX Plug‐in Photoshop Plug‐ins Books Cg Toolkit gDEBugger GPU Gems 3 Texture Tools NVSG GPU Gems 2 Melody PhysX SDK ShaderPerf GPU Gems PhysX Plug‐Ins PhysX VRD PhysX Tools The Cg Tutorial NVIDIA FX Composer 2.5 The World’s Most Advanced Shader Authoring Environment DirectX 10 Support NVIDIA Shader Debugger Support ShaderPerf 2.0 Integration Visual Models & Styles Particle Systems Improved User Interface Particle Systems All-New Start Page 350Z Sample Project Visual Models & Styles Other Major Features Shader Creation Wizard Code Editor Quickly create common shaders Full editor with assisted Shader Library code generation Hundreds of samples Properties Panel Texture Viewer HDR Color Picker Materials Panel View, organize, and apply textures Even More Features Automatic Light Binding Complete Scripting Support Support for DirectX 10 (Geometry Shaders, Stream Out, Texture Arrays) Support for COLLADA, .FBX, .OBJ, .3DS, .X Extensible Plug‐in Architecture with SDK Customizable Layouts Semantic and Annotation Remapping Vertex Attribute Packing Remote Control Capability New Sample Projects 350Z Visual Styles Atmospheric Scattering DirectX 10 PCSS Soft Shadows Materials Post‐Processing Simple Shadows
    [Show full text]
  • NVIDIA Quadro P4000
    NVIDIA Quadro P4000 GP104 1792 112 64 8192 MB GDDR5 256 bit GRAPHICS PROCESSOR CORES TMUS ROPS MEMORY SIZE MEMORY TYPE BUS WIDTH The Quadro P4000 is a professional graphics card by NVIDIA, launched in February 2017. Built on the 16 nm process, and based on the GP104 graphics processor, the card supports DirectX 12.0. The GP104 graphics processor is a large chip with a die area of 314 mm² and 7,200 million transistors. Unlike the fully unlocked GeForce GTX 1080, which uses the same GPU but has all 2560 shaders enabled, NVIDIA has disabled some shading units on the Quadro P4000 to reach the product's target shader count. It features 1792 shading units, 112 texture mapping units and 64 ROPs. NVIDIA has placed 8,192 MB GDDR5 memory on the card, which are connected using a 256‐bit memory interface. The GPU is operating at a frequency of 1227 MHz, which can be boosted up to 1480 MHz, memory is running at 1502 MHz. We recommend the NVIDIA Quadro P4000 for gaming with highest details at resolutions up to, and including, 5760x1080. Being a single‐slot card, the NVIDIA Quadro P4000 draws power from 1x 6‐pin power connectors, with power draw rated at 105 W maximum. Display outputs include: 4x DisplayPort. Quadro P4000 is connected to the rest of the system using a PCIe 3.0 x16 interface. The card measures 241 mm in length, and features a single‐slot cooling solution. Graphics Processor Graphics Card GPU Name: GP104 Released: Feb 6th, 2017 Architecture: Pascal Production Active Status: Process Size: 16 nm Bus Interface: PCIe 3.0 x16 Transistors: 7,200
    [Show full text]
  • Graphics Card Support List
    Graphics card support list Device Name Chipset ASUS GTXTITAN-6GD5 NVIDIA GeForce GTX TITAN ZOTAC GTX980 NVIDIA GeForce GTX980 ASUS GTX980-4GD5 NVIDIA GeForce GTX980 MSI GTX980-4GD5 NVIDIA GeForce GTX980 Gigabyte GV-N980D5-4GD-B NVIDIA GeForce GTX980 MSI GTX970 GAMING 4G GOLDEN EDITION NVIDIA GeForce GTX970 Gigabyte GV-N970IXOC-4GD NVIDIA GeForce GTX970 ASUS GTX780TI-3GD5 NVIDIA GeForce GTX780Ti ASUS GTX770-DC2OC-2GD5 NVIDIA GeForce GTX770 ASUS GTX760-DC2OC-2GD5 NVIDIA GeForce GTX760 ASUS GTX750TI-OC-2GD5 NVIDIA GeForce GTX750Ti ASUS ENGTX560-Ti-DCII/2D1-1GD5/1G NVIDIA GeForce GTX560Ti Gigabyte GV-NTITAN-6GD-B NVIDIA GeForce GTX TITAN Gigabyte GV-N78TWF3-3GD NVIDIA GeForce GTX780Ti Gigabyte GV-N780WF3-3GD NVIDIA GeForce GTX780 Gigabyte GV-N760OC-4GD NVIDIA GeForce GTX760 Gigabyte GV-N75TOC-2GI NVIDIA GeForce GTX750Ti MSI NTITAN-6GD5 NVIDIA GeForce GTX TITAN MSI GTX 780Ti 3GD5 NVIDIA GeForce GTX780Ti MSI N780-3GD5 NVIDIA GeForce GTX780 MSI N770-2GD5/OC NVIDIA GeForce GTX770 MSI N760-2GD5 NVIDIA GeForce GTX760 MSI N750 TF 1GD5/OC NVIDIA GeForce GTX750 MSI GTX680-2GB/DDR5 NVIDIA GeForce GTX680 MSI N660Ti-PE-2GD5-OC/2G-DDR5 NVIDIA GeForce GTX660Ti MSI N680GTX Twin Frozr 2GD5/OC NVIDIA GeForce GTX680 GIGABYTE GV-N670OC-2GD NVIDIA GeForce GTX670 GIGABYTE GV-N650OC-1GI/1G-DDR5 NVIDIA GeForce GTX650 GIGABYTE GV-N590D5-3GD-B NVIDIA GeForce GTX590 MSI N580GTX-M2D15D5/1.5G NVIDIA GeForce GTX580 MSI N465GTX-M2D1G-B NVIDIA GeForce GTX465 LEADTEK GTX275/896M-DDR3 NVIDIA GeForce GTX275 LEADTEK PX8800 GTX TDH NVIDIA GeForce 8800GTX GIGABYTE GV-N26-896H-B
    [Show full text]
  • NVIDIA Quadro RTX for V-Ray Next
    NVIDIA QUADRO RTX V-RAY NEXT GPU Image courtesy of © Dabarti Studio, rendered with V-Ray GPU Quadro RTX Accelerates V-Ray Next GPU Rendering Solutions for V-Ray Next GPU V-Ray Next GPU taps into the power of NVIDIA® Quadro® NVIDIA Quadro® provides a wide range of RTX-enabled RTX™ to speed up production rendering with dedicated RT solutions for desktop, mobile, server-based rendering, and Cores for ray tracing and Tensor Cores for AI-accelerated virtual workstations with NVIDIA Quadro Virtual Data denoising.¹ With up to 18X faster rendering than CPU-based Center Workstation (Quadro vDWS) software.2 With up to 96 solutions and enhanced performance with NVIDIA NVLink™, gigabytes (GB) of GPU memory available,3 Quadro RTX V-Ray Next GPU with RTX support provides incredible provides the power you need for the largest professional performance improvements for your rendering workloads. graphics and rendering workloads. “ Accelerating artist productivity is always our top Benchmark: V-Ray Next GPU Rendering Performance Increase on Quadro RTX GPUs priority, so we’re quick to take advantage of the latest ray-tracing hardware breakthroughs. By Quadro RTX 6000 x2 1885 ™ Quadro RTX 6000 104 supporting NVIDIA RTX in V-Ray GPU, we’re Quadro RTX 4000 783 bringing our customers an exciting new boost in PU 1 0 2 4 6 8 10 12 14 16 18 20 their GPU production rendering speeds.” Relatve Performance – Phillip Miller, Vice President, Product Management, Chaos Group Desktop performance Tests run on 1x Xeon old 6154 3 Hz (37 Hz Turbo), 64 B DDR4 RAM Wn10x64 Drver verson 44128 Performance results may vary dependng on the scene NVIDIA Quadro professional graphics solutions are verified and recommended for the most demanding projects by Chaos Group.
    [Show full text]
  • NVIDIA Launches Tegra X1 Mobile Super Chip
    NVIDIA Launches Tegra X1 Mobile Super Chip Maxwell GPU Architecture Delivers First Teraflops Mobile Processor, Powering Deep Learning and Computer Vision Applications NVIDIA today unveiled Tegra® X1, its next-generation mobile super chip with over one teraflops of processing power – delivering capabilities that open the door to unprecedented graphics and sophisticated deep learning and computer vision applications. Tegra X1 is built on the same NVIDIA Maxwell™ GPU architecture rolled out only months ago for the world's top-performing gaming graphics card, the GeForce® GTX 980. The 256-core Tegra X1 provides twice the performance of its predecessor, the Tegra K1, which is based on the previous-generation Kepler™ architecture and debuted at last year's Consumer Electronics Show. Tegra processors are built for embedded products, mobile devices, autonomous machines and automotive applications. Tegra X1 will begin appearing in the first half of the year. It will be featured in the newly announced NVIDIA DRIVE™ car computers. DRIVE PX is an auto-pilot computing platform that can process video from up to 12 onboard cameras to run capabilities providing Surround-Vision, for a seamless 360-degree view around the car, and Auto-Valet, for true self-parking. DRIVE CX is a complete cockpit platform designed to power the advanced graphics required across the increasing number of screens used for digital clusters, infotainment, head-up displays, virtual mirrors and rear-seat entertainment. "We see a future of autonomous cars, robots and drones that see and learn, with seeming intelligence that is hard to imagine," said Jen-Hsun Huang, CEO and co-founder, NVIDIA.
    [Show full text]
  • Programming Graphics Hardware Overview of the Tutorial: Afternoon
    Tutorial 5 ProgrammingProgramming GraphicsGraphics HardwareHardware Randy Fernando, Mark Harris, Matthias Wloka, Cyril Zeller Overview of the Tutorial: Morning 8:30 Introduction to the Hardware Graphics Pipeline Cyril Zeller 9:30 Controlling the GPU from the CPU: the 3D API Cyril Zeller 10:15 Break 10:45 Programming the GPU: High-level Shading Languages Randy Fernando 12:00 Lunch Tutorial 5: Programming Graphics Hardware Overview of the Tutorial: Afternoon 12:00 Lunch 14:00 Optimizing the Graphics Pipeline Matthias Wloka 14:45 Advanced Rendering Techniques Matthias Wloka 15:45 Break 16:15 General-Purpose Computation Using Graphics Hardware Mark Harris 17:30 End Tutorial 5: Programming Graphics Hardware Tutorial 5: Programming Graphics Hardware IntroductionIntroduction toto thethe HardwareHardware GraphicsGraphics PipelinePipeline Cyril Zeller Overview Concepts: Real-time rendering Hardware graphics pipeline Evolution of the PC hardware graphics pipeline: 1995-1998: Texture mapping and z-buffer 1998: Multitexturing 1999-2000: Transform and lighting 2001: Programmable vertex shader 2002-2003: Programmable pixel shader 2004: Shader model 3.0 and 64-bit color support PC graphics software architecture Performance numbers Tutorial 5: Programming Graphics Hardware Real-Time Rendering Graphics hardware enables real-time rendering Real-time means display rate at more than 10 images per second 3D Scene = Image = Collection of Array of pixels 3D primitives (triangles, lines, points) Tutorial 5: Programming Graphics Hardware Hardware Graphics Pipeline
    [Show full text]