DRPU a Programmable Hardware Architecture for Real-Time Ray Tracing of Coherent Dynamic Scenes

Total Page:16

File Type:pdf, Size:1020Kb

DRPU a Programmable Hardware Architecture for Real-Time Ray Tracing of Coherent Dynamic Scenes DRPU A Programmable Hardware Architecture for Real-time Ray Tracing of Coherent Dynamic Scenes Sven Woop Computer Graphics Group Saarland University 66123 Saarbr¨ucken, Germany Dissertation zur Erlangung des Grades des Doktors der Ingenieurwissenschaften (Dr.-Ing.) der Naturwissenschaftlich-Technischen Fakult¨at I der Universit¨at des Saarlandes Betreuender Hochschullehrer / Supervisor: Prof. Dr.-Ing. Philipp Slusallek Universit¨at des Saarlandes, Saarbr¨ucken, Germany Gutachter / Reviewers: Prof. Dr.-Ing. Philipp Slusallek Universit¨at des Saarlandes, Saarbr¨ucken, Germany Prof. Dr. Wolfgang J. Paul Universit¨at des Saarlandes, Saarbr¨ucken, Germany Prof. Erik Brunvand University of Utah, Salt Lake City, USA Dekan / Dean: Prof. Dr. Thorsten Herfet Universit¨at des Saarlandes, Saarbr¨ucken, Germany Eingereicht am / Thesis submitted on: 19. Dezember 2006 / December 19th, 2006 Datum des Kolloquiums / Date of defense: 18. Juni 2007 / 18th June 2007 Vorsitzender des Kolloquiums: Prof. Gert Smolka Wissenschaftlicher Beisitzer: Rafal Mantiuk, Ph.D. Sven Woop Lehrstuhl f¨ur Computergraphik, Geb. E 1 1 / E08 Im Stadtwald, 66123 Saarbr¨ucken, Germany [email protected] Version 1.1 June 29, 2007 iii Abstract Ray tracing is a rendering technique capable of generating high qual- ity photo-realistic images of three dimensional scenes. Rendering speed has been an issue for a long period, but recently high performance software im- plementations have made real-time ray tracing possible. But reaching per- formance levels comparable to rasterization also requires dedicated hardware solutions. This thesis proposes the DRPU architecture (Dynamic Ray Processing Unit) as the first programmable ray tracing hardware design for coherent dynamic scenes. For programmable shading it contains a shading processor that achieves a high level of efficiency due to SIMD processing of floating point vectors, massive multi-threading, and synchronous execution of pack- ets of threads. A dedicated traversal and intersection unit allows for efficient ray casting even in highly dynamic scenes by using B-KD Trees as spatial index structure - a kind of Bounding Volume Hierarchy. A Skinning Proces- sor is used to compute dynamic scene changes and an Update Processor to update the B-KD Tree node bounds after these changes. A working FPGA prototype implementation specified in the devel- oped hardware description language HWML is presented, which achieves performance levels comparable to commodity CPUs even though clocked at a 50 times lower frequency of 66 MHz. The prototype is mapped to a 130nm CMOS ASIC process that allows precise post layout performance estimates. These results are then extrapolated to a 90nm version with similar hardware complexity to current GPUs. It shows that with a similar amount of hard- ware resources frame-rates of 80 to 280 frames per second would be possible even with complex shading at 1024x768 resolution. This would form a good basis for game play and other real-time applications. iv v Kurzfassung Ray-Tracing ist eine Rendertechnik, mit der man hochqualitative und photorealistische Bilder von dreidimensionalen Szenen erstellen kann. Lange Zeit war die geringe Geschwindigkeit des Ray-Tracing-Algorithmus ein großes Problem, in den letzten Jahren wurde jedoch sogar Echtzeit-Ray-Tracing durch hochperformante Ray-Tracing Software erm¨oglicht. Um die Leis- tungsf¨ahigkeit von Rasterisierungs-Hardware zu erreichen sind jedoch spezielle Hardware-Implementierungen f¨ur Ray-Tracing unerl¨asslich. Diese Arbeit stellt die DRPU-Hardware-Architektur vor (Dynamic Ray Processing Unit), die erste voll programmierbare Ray-Tracing-Hardware- Architektur, die selbst hochdynamische Szenen unterst¨utzt. Die DRPU enth¨alt f¨ur programmierbares Shading einen Shading-Prozessor, der hohe Ef- fizienz durch SIMD-Verarbeitung von Fließkomma-Vektoren, massives Multi- Threading und synchrone Ausf¨uhrung von Paketen von Threads gew¨ahrleistet. Die Verwendung von B-KD-B¨aumen (einer Art Bounding- Volume-Hierarchie) sowie einem speziellen Traversierungs- und Schnittpunkt- Berechnungs-Prozessor erm¨oglicht ein effizientes Verfolgen von Strahlen selbst in hochdynamischen Szenen. Dynamische Ver¨anderungen der Szene k¨onnen effizient durch einen Skinning-Prozessor berechnet werden. Dies erfordert die Neuberechnung der Begrenzungen (Bounds) der Index-Struktur durch einen optimierten Update-Prozessor. Dar¨uber hinaus wird eine lauff¨ahige FPGA-Prototypen- Implementie- rung der Architektur vorgestellt, die eine ¨ahnlich hohe Leistungsf¨ahigkeit wie heutige hochperformante CPUs erreicht, trotz seiner 50-mal niedri- geren Taktfrequenz von nur 66 MHz. Der Prototyp wurde in einer ei- gens entwickelten Hardwarebeschreibungssprache HWML implementiert wo- durch eine einfache Portierung auf einen 130nm CMOS-Prozess erm¨oglicht wurde. Die Ergebnisse von Place und Route werden auf eine 90nm Version mit ¨ahnlicher Hardwarekomplexit¨at wie heutige GPUs hochgerechnet und pr¨azise Geschwindigkeitsabsch¨atzungen durchgef¨uhrt. Selbst f¨ur komplexes Sha- ding w¨aren mit solch einem Chip Bildwiederholraten von 80 bis 280 Bildern pro Sekunde bei einer Aufl¨osung von 1024x768 Pixeln m¨oglich. Dies stellt eine gute Grundlage f¨ur Computerspiele oder sonstige Echtzeitanwendungen dar. vi vii Zusammenfassung In der Echtzeit-Computergrafik ist der Rasterisierungs-Algorithmus der heute dominierende Algorithmus, haupts¨achlich da f¨ur diesen sehr effi- ziente Hardware-Implementierungen existieren. Die Grundidee der Raste- risierung besteht darin, die Szene abzubilden, indem jedes Dreieck einzeln auf die Bildebene projiziert wird und jedes bedeckte Pixel gezeichnet bzw. geshaded wird. Da dieses Verfahren Dreiecke unabh¨angig voneinander be- trachtet, k¨onnen komplexe Effekte wie Schatten oder Reflektionen nicht di- rekt berechnet werden, sondern nur durch ineffiziente Multi-Pass-Verfahren approximiert werden. Ray-Tracing hingegen erm¨oglicht die physikalisch korrekte Berechnung solcher Shading-Effekte, indem es Licht r¨uckw¨arts durch die Szene verfolgt und so beispielsweise durch das Schießen eines Reflektionsstrahls den Beitrag des Lichts aus der Reflektionsrichtung direkt in die Farbberechnung mitein- beziehen kann. Da das Schießen von Strahlen durch die Szene eine hohe Rechenkomplexit¨at erfordert, war Echtzeit-Ray-Tracing lange Zeit undenk- bar. In den letzten Jahren haben effiziente Software-Implementierungen dies jedoch erm¨oglicht. Diese Arbeit beschreibt die DRPU-Hardware-Architektur, die Ray- Tracing nicht nur echtzeitf¨ahig macht, sondern in Performanz und Funk- tionalit¨at nahe zur Rasterisierung bringt. Die DRPU erlaubt einerseits das rekursive Schießen von Strahlen w¨ahrend des voll programmierbaren Sha- dings, andererseits werden jedoch auch hochdynamische Szenen unterst¨utzt, solange eine gewisse Koh¨arenz der Dynamik vorhanden ist. Kapitel 1 gibt eine Einf¨uhrung in das Rendering, indem der Rasterisierungs-Algorithmus und der Ray-Tracing-Algorithmus beschrieben und verglichen werden. In Kapitel 2 wird ein Uberblick¨ ¨uber die DRPU- Architektur gegeben, in dem auf die Zusammenarbeit der einzelnen Hard- wareeinheiten eingegangen wird und wichtige Konzepte wie Multi-Threading oder Paket-Bearbeitung von Threads beschrieben werden. Kapitel 3 besch¨aftigt sich mit speziellen Hardwareeinheiten zum Schießen von Strahlen durch dynamische Szenen. Als eine neue Daten- struktur f¨ur Ray-Tracing von dynamischen Szenen werden B-KD B¨aume beschrieben, die eine Art Bounding-Volume-Hierarchie mit eindimensionalen Begrenzungen darstellen. Durch diese Datenstruktur werden Strahlen von einem Traversierungs-Prozessor (TP) traversiert, sowie Strahlen mit Drei- ecken durch einen Schnittpunkt-Berechnungs-Prozessor (GP) geschnitten. Diese beiden Einheiten erm¨oglichen das effiziente Verfolgen von Strahlen durch die Szene, eine der teuersten Operationen von Ray-Tracing. Um dy- namische Szenen zu unterst¨utzen, m¨ussen Teile der B-KD-B¨aume der Szene aktualisiert werden. Dies geschieht durch einen speziellen Update-Prozessor. Es wird demonstriert, dass ein FPGA-Prototyp, der diese Hardwareeinheiten viii zum Strahlenschießen verwendet, schneller ist als herk¨ommliche Software- Implementierungen. Eine weitere wichtige Operation von Ray-Tracing ist das Shading, das an jedem Auftreffpunkt eines Strahls mit der Szene eine Farbe berechnet. Kapitel 4 beschreibt den speziellen Shading-Prozessor (SP) der DRPU, der einen ¨ahnlichen Befehlssatz wie Fragment-Programme der heutigen GPUs hat. Der Shading Prozessor besitzt jedoch deutlich mehr F¨ahigkeiten, da er einerseits hardwaregest¨utzte Rekursion und zum anderen flexible Speicher- zugriffe erlaubt. F¨ur die Implementierung von rekursivem Ray-Tracing wird insbesondere die Rekursion ben¨otigt, wodurch Ray-Tracing-Effekte sehr ein- fach zu kombinieren sind. Der Shading-Prozessor k¨onnte auch dazu verwendet werden die dyna- mischen Ver¨anderungen der Szene zu berechnen w¨are hierbei jedoch nicht so effizient wie eine spezielle Hardwareeinheit. Deshalb wird in Kapitel 5 ein weiterer Spezial-Prozessor beschrieben, welcher einen weit verbreiteten Skinning-Algorithmus implementiert und sehr effizient ausf¨uhren kann. Da viele bereits vorhandene Hardware-Ressourcen von diesem Skinning Prozes- sor wiederverwendet werden, entstehen durch ihn kaum zus¨atzliche Kosten in der DRPU-Architektur. Kapitel 6 widmet sich einer eigens entwickelten Hardwarebeschreibungs- Bibliothek HWML (Hardware Meta Language) f¨ur die funktionale Program- miersprache ML die die Implementierung der DRPU-Architektur stark ver- einfacht
Recommended publications
  • Realizm Data Sheet200.Indd
    The Ultimate in Professional 3D Graphics Processing Welcome to a new kind of Realizm . where precision, speed, and your creativity are combined in ways you’ve only dreamed. 3Dlabs® puts the power of the industry’s most advanced visual processing right at your fingertips with Wildcat® Realizm™ 200. 3Dlabs’ AGP 8x-based graphics solution delivers all the performance, image fidelity, and features you’d expect from a professional graphics accelerator. So, whether you’re working on realistic animations, intricate CAD renderings, or complex scientific visualizations – if you can imagine it, you can make it real with Wildcat Realizm. Remove the boundaries to Remove the boundaries to your creativity. your productivity. With Wildcat Realizm 200’s no- Wildcat Realizm graphics compromise performance plus accelerators offer the highest levels the industry’s largest memory of image precision. You get quality resources, you’ll have more time and performance in one advanced to devote to your creativity. technology solution. Unmatched VPU Performance Extreme Geometry Performance • The most advanced Visual Processing Unit (VPU) available today • Manipulate the most complex models easily in real-time offering unparalleled levels of performance, programmability, • Wildcat Realizm’s VPU features full floating-point pipelines from With over 40 years of combined engineering talent, accuracy, and fidelity input vertices to displayed pixels to offer you unparalleled levels of 3Dlabs is the only graphics hardware developer 100% • Optimized floating-point precision across the entire pipeline performance, programmability, accuracy, and fidelity dedicated to building solutions designed specifically for The Most Memory Available on Any AGP Graphics Card – Image Quality graphics professionals. • Genuine real-time image manipulation and rendering using advanced 512 MB The Advanced Benefits of Wildcat Realizm 200 .
    [Show full text]
  • From a Programmable Pipeline to an Efficient Stream Processor
    Computation on GPUs: From a Programmable Pipeline to an Efficient Stream Processor João Luiz Dihl Comba 1 Carlos A. Dietrich1 Christian A. Pagot1 Carlos E. Scheidegger1 Resumo: O recente desenvolvimento de hardware gráfico apresenta uma mu- dança na implementação do pipeline gráfico, de um conjunto fixo de funções, para programas especiais desenvolvidos pelo usuário que são executados para cada vértice ou fragmento. Esta programabilidade permite implementações de diversos algoritmos diretamente no hardware gráfico. Neste tutorial serão apresentados as principais técnicas relacionadas a implementação de algoritmos desta forma. Serão usados exemplos baseados em artigos recentemente publicados. Através da revisão e análise da contribuição dos mesmos, iremos explicar as estratégias por trás do desenvolvimento de algoritmos desta forma, formando uma base que permita ao leitor criar seus próprios algoritmos. Palavras-chave: Hardware Gráfico Programável, GPU, Pipeline Gráfico Abstract: The recent development of graphics hardware is presenting a change in the implementation of the graphics pipeline, from a fixed set of functions, to user- developed special programs to be executed on a per-vertex or per-fragment basis. This programmability allows the efficient implementation of different algorithms directly on the graphics hardware. In this tutorial we will present the main techniques that are involved in implementing algorithms in this fashion. We use several test cases based on recently published pa- pers. By reviewing and analyzing their contribution, we explain the reasoning behind the development of the algorithms, establishing a common ground that allow readers to create their own novel algorithms. Keywords: Programmable Graphics Hardware, GPU, Graphics Pipeline 1UFRGS, Instituto de Informática, Caixa Postal 15064, 91501-970 Porto Alegre/RS, Brasil e-mail: {comba, cadietrich, capagot, carlossch}@inf.ufrgs.br Este trabalho foi parcialmente financiado pela CAPES, CNPq e FAPERGS.
    [Show full text]
  • Evolution of the Programmable Graphics Pipeline
    Course Roadmap Evolution of the Programmable Graphics ►Graphics Pipeline (GLSL) ►GPGPU (GLSL) Pipeline Briefly ►GPU Computing (CUDA, OpenCL) Lecture 2 ►Choose your own adventure Original Slides by: Suresh Venkatasubramanian Choose your own adventure Updates by Joseph Kider Student Presentation Final Project ►Goal: Prepare you for your presentation and project Lecture Outline The evolution of the pipeline ► A historical perspective on the graphics pipeline Elements of the graphics pipeline: Parameters controlling design of the pipeline: Dimensions of innovation. 1. A scene description: vertices, triangles, colors, lighting 1. Where is the boundary Where we are today between CPU and GPU ? 2. Transformations that map the Fixed-function vs programmable pipelines scene to a camera viewpoint 2. What transfer method is used ? ► A closer look at the fixed function pipeline 3. “Effects”: texturing, shadow 3. What resources are provided Walk thru the sequence of operations mapping, lighting calculations at each step ? Reinterpret these as stream operations 4. Rasterizing: converting geometry 4. What units can access which into pixels GPU memory elements ? ► We can program the fixed-function pipeline ! 5. Pixel processing: depth tests, Some examples stencil tests, and other per-pixel ► What constitutes data and memory, and how operations. access affects program design. Generation I: 3dfx Voodoo (1996) Aside: Mario Kart 64 • One of the first true 3D game cards ►High fragment load / low vertex load • Worked by supplementing standard 2D video
    [Show full text]
  • Evolution of the Programmable Graphics Pipeline Graphics Pipeline
    Administrivia Evolution of the Tip: google “cis 565” Programmable Slides posted before each class Graphics Pipeline Tentative assignment dates on website 1st assignment handed out today Patrick Cozzi Write concisely University of Pennsylvania Due start of class, one week from today CIS 565 - Spring 2011 Google group in progress FYI. GDC Early Registration - 01/24 Survey Results Survey Results 15/23 – graphics experience Class interests Most students have usable video cards Pure architecture Game rendering Lerk – don’t be scared Physical simulations I want to be a Toys R Us kid too Animation Vision algorithms Image/video processing … 1 Course Roadmap Agenda Graphics Pipeline (GLSL) Why program the GPU? GPGPU (GLSL) Graphics Review Briefly Evolution of the Programmable Graphics GPU Computing (CUDA, OpenCL) Pipeline Choose your own adventure Understand the past Student Presentation Final Project Goal : Prepare you for your presentation and project Why Program the GPU? Why Program the GPU? Graph from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Graph from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf 2 Why Program the GPU? NVIDIA GPU Evolution Compute Intel Core i7 – 4 cores – 100 GFLOP NVIDIA GTX280 – 240 cores – 1 TFLOP Memory Bandwidth System Memory – 60 GB/s NVIDIA GT200 – 150 GB/s Install Base Over 200 million NVIDIA G80s shipped Numbers from Programming Massively Parallel Processors . Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Graphics Review Graphics Review: Modeling Modeling Modeling Rendering Polygons vs Triangles How do you store a triangle mesh? Animation Implicit Surfaces Height maps … 3 Triangles Triangles Image courtesy of A K Peters, Ltd.
    [Show full text]
  • Opengl Shading Language Course Chapter 1 – Introduction to GLSL By
    OpenGL Shading Language Course Chapter 1 – Introduction to GLSL By Jacobo Rodriguez Villar [email protected] TyphoonLabs’ GLSL Course 1/29 CHAPTER 1: INTRODUCTION INDEX An Introduction to Programmable Hardware 3 Brief History of the OpenGL Programmable Hardware Pipeline 3 Fixed Function vs. Programmable Function 5 Programmable Function Scheme 6 Fixed Function Scheme 7 Shader 'Input' Data 7 Uniform Variables 7 Vertex Attributes 9 Varying Variables 10 Shader 'Output' Data 12 Vertex Shader 12 Fragment Shader 12 Simple GLSL Shader Example 13 Shader Designer IDE 16 User Interface 16 Toolbar 17 Menu 17 State Properties 22 Light States 22 Vertex States 24 Fragment States 25 Code Window 26 Uniform Variables Manager 27 TyphoonLabs’ GLSL Course 2/29 An Introduction to Programmable Hardware Brief history of the OpenGL Programmable Hardware Pipeline 2000 Card(s) on the market: GeForce 2, Rage 128, WildCat, and Oxygen GVX1 These cards did not use any programmability within their pipeline. There were no vertex and pixel shaders or even texture shaders. The only programmatically think was the register combiners. Multi-texturing and additive blending were used to create clever effects and unique materials. 2001 Card(s) on the market: GeForce 3, Radeon 8500 With GeForce 3, NVIDIA introduced programmability into the vertex processing pipeline, allowing developers to write simple 'fixed-length' vertex programs using pseudo-assembler style code. Pixel processing was also improved with the texture shader, allowing more control over textures. ATI added similar functionality including some VS and FS extensions (EXT_vertex_shader and ATI_fragment_shader) Developers could now interpolate, modulate, replace, and decal between texture units, as well as extrapolate or combine with constant colors.
    [Show full text]
  • A Survey of General-Purpose Computation on Graphics Hardware
    EUROGRAPHICS 2005 STAR – State of The Art Report A Survey of General-Purpose Computation on Graphics Hardware John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell Abstract The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programma- bility, have made graphics hardware a compelling platform for computationally demanding tasks in a wide va- riety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest. Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Hardware Architecture; D.2.2 [Software Engineering]: Design Tools and Techniques 1. Introduction: Why GPGPU? 1.1. Powerful and Inexpensive Recent graphics architectures provide tremendous memory Commodity computer graphics chips are probably today’s bandwidth and computational horsepower.
    [Show full text]
  • Advanced Computer Graphics to Do Motivation Real-Time Rendering
    To Do Advanced Computer Graphics § Assignment 2 due May 19 § Should already be well on way. CSE 163 [Spring 2017], Lecture 12 § Contact us for difficulties etc. Ravi Ramamoorthi http://www.cs.ucsd.edu/~ravir Motivation Real-Time Rendering § Today, create photorealistic computer graphics § Goal: interactive rendering. Critical in many apps § Complex geometry, lighting, materials, shadows § Games, visualization, computer-aided design, … § Computer-generated movies/special effects (difficult or impossible to tell real from rendered…) § Until 10-15 years ago, focus on complex geometry § CSE 168 images from rendering competition (2011) § § But algorithms are very slow (hours to days) Chasm between interactivity, realism Evolution of 3D graphics rendering Offline 3D Graphics Rendering Interactive 3D graphics pipeline as in OpenGL Ray tracing, radiosity, photon mapping § Earliest SGI machines (Clark 82) to today § High realism (global illum, shadows, refraction, lighting,..) § Most of focus on more geometry, texture mapping § But historically very slow techniques § Some tweaks for realism (shadow mapping, accum. buffer) “So, while you and your children’s children are waiting for ray tracing to take over the world, what do you do in the meantime?” Real-Time Rendering SGI Reality Engine 93 (Kurt Akeley) Pictures courtesy Henrik Wann Jensen 1 New Trend: Acquired Data 15 years ago § Image-Based Rendering: Real/precomputed images as input § High quality rendering: ray tracing, global illumination § Little change in CSE 168 syllabus, from 2003 to
    [Show full text]
  • All the Pipelines – Journey Through the Gpu Lou Kramer, Developer Technology Engineer, Amd Overview
    ALL THE PIPELINES – JOURNEY THROUGH THE GPU LOU KRAMER, DEVELOPER TECHNOLOGY ENGINEER, AMD OVERVIEW … GIC 2020: All the pipelines – Journey through the GPU 2 CONTENT CREATION Some 3d model created via your software of choice (e.g., Blender - www.blender.org). This model is represented by a bunch of triangles. Each triangle is defined by 3 vertices. Vertices can have a number of attributes: ▪ Position ▪ Normal Vector ▪ Texture coordinate ▪ … GIC 2020: All the pipelines – Journey through the GPU 3 CONTENT CREATION .dae .abc .3ds Export .fbx .ply .obj .x3d .stl <custom> Positions Normal Vectors Texture Coordinates Connectivity Information … GIC 2020: All the pipelines – Journey through the GPU 4 CONTENT CREATION .dae .abc .3ds .fbx .ply Import .obj .x3d Game Engine of your choice .stl <custom> GIC 2020: All the pipelines – Journey through the GPU 5 RENDERING – PREPARATION ON THE CPU Geometry Render Abstraction Graphics Front End Layer APIs Engine Specific format Mesh Creation: • Vertex Buffers. Vulkan® • Index Buffers. MyDraw (vkCmdDrawIndexed,vkCmdDispatch, …) • Textures. MyDispatch D3D12 • … … Visibility Testing (DrawIndexedInstanced,Dispatch, …) • The less work the GPU D3D11 needs to do the better. … … Buffers in List of Commands System Memory (CPU) GIC 2020: All the pipelines – Journey through the GPU 6 RENDER FRONT END Data (Buffers, Textures …) PCIe® System memory Video memory GIC 2020: All the pipelines – Journey through the GPU 7 GPU COMMANDS List of Commands vkCmdBindPipeline vkCmdBindVertexBuffers vkCmdBindIndexBuffer vkCmdDrawIndexed ▪ Send a batch of commands to the GPU … so the GPU is busy for quite a while. ▪ Every command list submission takes some time! GIC 2020: All the pipelines – Journey through the GPU 8 GPU COMMANDS List of Commands vkCmdBindPipeline vkCmdBindVertexBuffers vkCmdBindIndexBuffer vkCmdDrawIndexed ▪ Send a batch of commands to the GPU … so the GPU is busy for quite a while.
    [Show full text]
  • Graphics and Computing Gpus
    B APPENDIX Graphics and Computing GPUs John Nickolls Imagination is more Director of Architecture important than NVIDIA knowledge. David Kirk Chief Scientist Albert Einstein On Science, 1930s NVIDIA B.1 Introduction B-3 B.2 GPU System Architectures B-7 B.3 Programming GPUs B-12 B.4 Multithreaded Multiprocessor Architecture B-25 B.5 Parallel Memory System B-36 B.6 Floating-point Arithmetic B-41 B.7 Real Stuff: The NVIDIA GeForce 8800 B-46 B.8 Real Stuff: Mapping Applications to GPUs B-55 B.9 Fallacies and Pitfalls B-72 B.10 Concluding Remarks B-76 B.11 Historical Perspective and Further Reading B-77 B.1 Introduction Th is appendix focuses on the GPU—the ubiquitous graphics processing unit graphics processing in every PC, laptop, desktop computer, and workstation. In its most basic form, unit (GPU) A processor the GPU generates 2D and 3D graphics, images, and video that enable Window- optimized for 2D and 3D based operating systems, graphical user interfaces, video games, visual imaging graphics, video, visual computing, and display. applications, and video. Th e modern GPU that we describe here is a highly parallel, highly multithreaded multiprocessor optimized for visual computing. To provide visual computing real-time visual interaction with computed objects via graphics, images, and video, A mix of graphics the GPU has a unifi ed graphics and computing architecture that serves as both a processing and computing programmable graphics processor and a scalable parallel computing platform. PCs that lets you visually interact with computed and game consoles combine a GPU with a CPU to form heterogeneous systems.
    [Show full text]
  • GPU Architecture
    GPU Architecture Michael Doggett ATI GPU Architecture • RADEON X1800/X1900 • Microsoft’s XBOX360 • Xenos GPU • GPU research areas ATI - Driving the Visual Experience Everywhere • Products from cell phones to super computers Integrated Gaming Console Embedded Display Gaming Notebook Color Phone Display Digital TV Multimedia Workstation Multi Monitor Display RADEON X1800/X1900 RADEON X1800 • November ‘05 • DirectX 9.0 Shader Model 3.0 • Core clock 625MHz • Memory clock 750MHz • 512MB of graphics memory • 321 Million transistors, 288mm2, 90nm • 8 Vertex Shaders • 16 Pixel Shaders • 32bit IEEE Single Precision float • 4 quad based, threaded units • 512 threads of 16 pixels allow efficient dynamic branching X1800 top level architecture Vertex Shader Pixel Shader • ALU 1 • 1 Vec3 ADD + Input Modifier • 1 Scalar ADD + Input Modifier • ALU 2 • 1 Vec3 ADD/MULL/MADD • 1 Scalar ADD/MULL/MADD • Branch Execution Unit • 1 Flow Control Instruction Memory Controller RADEON X1900 XTX • 48 shader pipes • Core clock 650MHz • Memory clock 775MHz • 512MB of graphics memory • 384 Million transistors, 352mm2, 90nm Xenos: XBOX360 GPU System architecture CPU 2x 10.8 GB/s 22.4GB/s UNIFIED Southbridge GPU MEMORY 2x PCIE Northbridge 700MHz 500MB/s 128bit GDDR3 32GB/s DAUGHTER DIE Rendering performance • GPU to Daughter Die interface •8 pixels/clk • 32BPP color • 4 samples Z - Lossless compression • 16 pixels/clk – Double Z • 4 samples Z - Lossless compression GPU 32GB/s DAUGHTER DIE Rendering performance • Alpha and Z logic to EDRAM interface • 256GB/s • Color and
    [Show full text]
  • Efficient GPU Rendering of Subdivision Surfaces Using Adaptive Quadtrees
    Efficient GPU Rendering of Subdivision Surfaces using Adaptive Quadtrees Wade Brainerd* Tim Foley* Manuel Kraemer Henry Moreton Matthias Nießner Activision NVIDIA NVIDIA NVIDIA Stanford University Figure 1: In our method, a subdivision surface model (left) is rendered in a single pass, without a separate subdivision step. Each quad face is submitted as a single tessellated primitive; a per-face adaptive quadtree is used to map tessellated vertices to the appropriate subdivided face (middle). Our approach makes tessellated subdivision surfaces easy to integrate into modern video game rendering (right). c 2014 Activision Publishing, Inc. Abstract 1 Introduction We present a novel method for real-time rendering of subdivision Subdivision surfaces [Catmull and Clark 1978; Loop 1987; Doo surfaces whose goal is to make subdivision faces as easy to ren- and Sabin 1978] have been used in movie productions for many der as triangles, points, or lines. Our approach uses standard GPU years. They have evolved into a de facto industry standard sur- tessellation hardware and processes each face of a base mesh inde- face representation, due to the flexibility they provide in modeling. pendently, thus allowing an entire model to be rendered in a single With an increasing demand for richer images with more and more pass. The key idea of our method is to subdivide the u; v domain of visual detail, it is desirable to render such movie-quality assets in each face ahead of time, generating a quadtree structure, and then real time, enabling the use of subdivision surfaces in both content submit one tessellated primitive per input face.
    [Show full text]
  • Real-Time High Quality Rendering Outline of Lecture Basic Hardware
    Outline of Lecture Real-Time High Quality Rendering §. Basics of hardware pipeline §. Reflection and Rendering equations CSE 274 [Fall 2015], Lecture 2 §. Typical Lighting, shading in hardware Graphics Hardware Pipeline, Reflection and §. Taxonomy of methods/papers Rendering Equations, Taxonomy of Methods http://www.cs.ucsd.edu/~ravir §. Assignment: Sign up for paper presentations §. And send basic info: §. Name, e-mail, status (Senior, PhD etc.) Background in graphics/comments §. Will you be taking course grades or P/F Basic concepts/review only. If interested in more background/in depth detail, refer to handouts Basic Hardware Pipeline Geometry or Vertex Pipeline Application Geometry Rasterizer CPU GPU Model, View Lighting Projection Clipping Screen Transform These fixed function stages can be replaced by a general per-vertex calculation using vertex shaders in modern programmable hardware Create geometry, lights, Transform and lighting calcs. materials, textures, Apply per-vertex operations Textures, Cubemaps cubemaps, … as inputs Per-pixel (per-fragment) operations Pixel or Fragment Pipeline GPU Programmable Shaders Programmable in Modern GPUs Programmable in (Vertex Shader) Modern GPUs Rasterization Texture Z-buffering Framebuffer Geometry (Fragment (scan conversion) Mapping Vertices Primitive Shader) Scan Framebuffer Operations Fragment Conversion Operations (Rasterize) These fixed function stages can be replaced by a general per-fragment calculation using fragment shaders in modern programmable hardware Pixel Texture Images Operations Memory Traditional Approach: Fixed function pipeline (state machine) New Development (2003-): Programmable pipeline 1 Outline of Lecture Reflection Equation §. Basics of hardware pipeline §. Reflection and Rendering equations §. Typical Lighting, shading in hardware ω §. Taxonomy of methods/papers i ωr x §. Assignment: Sign up for paper presentations L (x,ω ) = L (x,ω ) + L (x,ω )f (x,ω ,ω )(ω i n) r r e r i i i r i Basic concepts/review only.
    [Show full text]